Chapter 12: Indexing and Hashing

About This Presentation

Title:

Chapter 12: Indexing and Hashing

Description:

Indexes speed up access to data in a table. card catalog in a library (author, ... as a disk block, typically 4 kilobytes, n is typically around 100 (assuming 40 ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 81

Provided by: marily198

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 12: Indexing and Hashing

1
Chapter 12 Indexing and Hashing

Basic Index Concepts
Ordered Indices
B-Tree Index Files
Hashing
Index Definition in SQL
Multiple-Key Access
Bitmap Indices

2
Basic Index Concepts

Indexes speed up access to data in a table.
card catalog in a library (author, title,
subject)
A set of one or more attributes used to look up
records in a table is referred to as a search
key.
In the simplest case, an index file consists of
records of the form
Each such record is referred to as an index entry.

3
Basic Index Concepts, Cont.

An index file is a file, and suffers from many of
the same problems as a data file.
Index files are typically much smaller than the
original file.
10 - 25
Two kinds of indices, primarily
Ordered indices - index entries are stored in
sorted order, based on the search key.
Hash indices - search keys are distributed
uniformly across buckets using a hash
function.

4
Index Evaluation Metrics

Insertion time (OLTP)
Deletion time (OLTP)
Space overhead
Access time (DSS) for
Point Queries - Records with a specified value in
an attribute.
Range Queries - Records with an attribute value
in a specified range.

5
Ordered Indices

An index whose search key specifies the
sequential order of the data file is a primary
index.
Also called clustering or clustered index.
Search key of a primary index is frequently the
primary key.
An index whose search key does not specify the
sequential order of the data file is a secondary
index.
Also called a non-clustering or non-clustered
index.
An ordered sequential file with a primary index
is an index-sequential file.

6
Dense Index Files

An index that contains an index record for every
search-key value in the data file is a dense
index.

7
Dense Index Files, Cont.

To locate the record(s) with search-key value K
Find index record with search-key value K.
Follow pointer from the index record to the data
record(s).
To delete a record
Locate the record in the data file, perhaps using
the above procedure.
Delete the record from the data file.
If the deleted record was the only one with that
search-key value, then delete the search-key from
the index (similar to data record deletion)
To insert a record
Perform an index lookup using the records
search-key value.
If the search-key value does not appear in the
index, insert it.
Insert the record into the data file (exactly
where depends).
Assign a pointer to the data record from the
index entry.

8
Sparse Index Files

An index that contains index records but only for
some search-key values in the data file is a
sparse index.
Typically one index entry for each data file
block.
Only applicable when records are sequentially
ordered on search-key, i.e., the index is a
primary index.

9
Sparse Index Files, Cont.

To locate a record with search-key value K
Find the index record with largest search-key
value lt K.
Search file sequentially from the record to which
the index record points.
To delete a record
Locate the record in the data file, perhaps using
the above procedure.
Delete the record from the data file.
If the deleted record was the only record with
its search-key value, and if an entry for the
search key exists in the index, then replace the
index entry with the next search-key value in the
data file (in search-key order). If the next
search-key value already has an index entry, the
entry is simply deleted.
To insert a record
Perform an index lookup using the records
search-key value suppose that the index stores
an entry for each block of the data file.
If the index entry points to a block with free
space, then simply insert the record in that
block.
If the index entry points to a full block, then
allocate a new block and insert the first
search-key value appearing in the new block into
the index

10
Sparse Index Files, Cont.

In general, sparse indices
Require less space and maintenance for insertions
and deletions.
Are generally slower than dense indices for
locating records, especially if there is more
than one block per index entry.

11
Multilevel Index

If an index does not fit in memory, access
becomes expensive.
To reduce the number of disk accesses to index
records, treat it as a sequential file on disk
and build a sparse index on it.
outer index a sparse index
inner index sparse or dense index
If the outer index is still too large to fit in
main memory, yet another level of index can be
created, and so on.

12
Multilevel Index, Cont.

Indices at all levels might require updating upon
insertion or deletion.
Multilevel insertion, deletion and lookup
algorithms are simple extensions of the
single-level algorithms.

13
Secondary Indices

Frequently a table is searched using a search key
(i.e., a set of columns) other than the one on
which the table is sorted.
Suppose account is sorted by account number.
Searching based on branch, or searching for a
range of balances.
Although the table is not sorted based on the
search key, an index can nonetheless be
constructed on the search key.
As noted previously, such an index is called a
secondary index, i.e., an index whose search key
does not specify the sequential order of the data
file.

14
Secondary Indices

A secondary index will contain an entry for each
search-key value.
Each index entry will point to either a
Single entry containing the search key value
(candidate key).
Bucket that contains pointers to all records with
that search-key value (non-candidate key).
All previous algorithms and data structures can
be modified to apply to secondary indices.

15
Secondary Indexon balance field of account

Note that secondary indices have to be dense
(why?)

16
Primary and Secondary Indices

Indices improve performance
Primary Indices
Point queries
Range queries
Secondary Indices
Point queries
Indices can also hurt performance
All indices must be updated upon insertion or
deletion.
Scanning a file sequentially in secondary
search-key order can be expensive.
Worst case - each record access may fetch a new
block from disk
Thus, the number of tuples retrieved is an upper
bound on the number of data blocks retrieved when
scanning a secondary index for a range query.
The same upper bound applies to a primary index,
however in that case the upper bound is not
tight (we will return to this in query
optimization)

17
B-Tree Index Files

B-tree indices are a type of multi-level index.
Disadvantage of traditional sparse and dense
index files
Performance degrades as the index file grows,
since many overflow blocks get created in the
index file.
Periodic reorganization of entire index file is
required.
Advantage of B-tree index files
Automatically reorganizes itself with small,
local, changes.
Index reorganization is not required as
frequently.

18
Example of a B-tree

Disadvantage of B-trees - extra insertion,
deletion and space overhead.
Advantages of B-trees outweigh disadvantages,
and they are used extensively.

B-tree for account file (n 3)
19
B-Tree Index Files (Cont.)
A B-tree is a rooted tree satisfying the
following properties

All paths from root to leaf are of the same
length
Each node that is not a root or leaf must be at
least half full
Between ?n/2? and n children
Between ?n/2? 1 and n 1 search key values.
A leaf node has between ?n/2? 1 and n 1 search
key values.
Special cases
If the root is not a leaf, it can have as few as
2 children.
If the root is a leaf (that is, there are no
other nodes in the tree), it can have as few as 1
value and no children.

20
B-Tree Node Structure

Node structure (leaf or internal)
Ki are the search-key values
Pi are pointers to children (for non-leaf nodes)
or pointers to records or buckets of records (for
leaf nodes).
The search-keys in a node are ordered
K1 lt K2 lt K3 lt . . . lt Kn1

21
Leaf Nodes in B-Trees
Properties of a leaf node

For i 1, 2, . . ., n1, pointer Pi either
points to a file record with search-key value Ki,
or to a bucket of pointers to file records, each
record having search-key value Ki.
Only need bucket structure if search-key does not
form a primary key.
If Li, Lj are leaf nodes and i lt j, Lis
search-key values are less than Ljs search-key
values
Pn points to next leaf node in search-key order

22
Non-Leaf Nodes in B-Trees

Non-leaf nodes form a multi-level sparse index on
the leaf nodes.
For a non-leaf node with n pointers
All the search-keys in the subtree to which P1
points are less than K1
For 2 ? i ? n 1, all the search-keys in the
subtree to which Pi points have values greater
than or equal to Ki1 and less than Ki
All the search-keys in the subtree to which Pn
points are greater than Kn-1

23
Example of B-tree
B-tree for account file (n 5)

Leaf nodes must have between 2 and 4 values
(?n/2? 1 and n 1, with n 5).
Non-leaf nodes other than root must have between
3 and 5 children (?n/2? and n, with n 5).
Non-leaf notes other than the root must have
between 2 and 4 values.
Root must have at least 2 children.

24
Observations about B-trees

A B-tree contains a relatively small number of
levels - logarithmic in the number of search keys
appearing in the main file, thus searches can be
conducted efficiently.
Insertions and deletions to the index file can be
handled efficiently, as the index can be
restructured in logarithmic time (as we shall
see).
Each node is typically a disk block
logically close blocks need not be physically
close.

25
Queries on B-Trees

Searching a B tree for a given search key value
is a straight-forward generalization of searching
a binary search tree.
Find all records with a search-key value of k
(see page 492)
Start with the root node
Examine the node for the smallest search-key
value gt k.
If such a value exists, call it is Kj, then
follow Pj to the child node
Otherwise k ? Km1, where there are m pointers in
the node. Then follow Pm to the child node.
If the node is not a leaf, repeat the above
procedure on that node.
Eventually reach a leaf node.
If for some i, key Ki k, follow pointer Pi to
the desired record or bucket.
Otherwise no record with search-key value k
exists.

26
Queries on B-Trees (Cont.)

In processing a query, a path is traversed in the
tree from the root to some leaf node.
If there are K search-key values in the file, the
path is no longer than ? log?n/2?(K)?.
Since a node is generally the same size as a disk
block, typically 4 kilobytes, n is typically
around 100 (assuming 40 bytes per index entry).
With 1 million search key values and n 100, at
most log50(1,000,000) 4 nodes are accessed in
a lookup.
Contrast this with a balanced binary free with 1
million search key values around 20 nodes,
i.e., blocks, are accessed.

27
Updates on B-Trees Insertion

Find the leaf node in which the search-key value
would appear
If the search-key value is already in the leaf
node
The record is added to data file, and
If necessary a pointer is inserted into the
bucket.
If the search-key value is not in the leaf node
Add the record to the data file, and
Create a bucket if necessary.
If there is room in the leaf node, insert the
(key-value, pointer) pair.
Otherwise, split the node as discussed in the
next slide.

28
Updates on B-Trees Insertion (Cont.)

Splitting a node
Examine the n (search-key value, pointer) pairs
(including the one being inserted) in sorted
order.
Place the first ?n/2? in the original node, and
the rest in a new node.
Let the new node be p, and let k be the least key
value in p.
Insert (k,p) in the parent of the node being
split (recursively).
If the parent is full, split it and propagate the
split further up.
Splitting nodes continues up the tree until a
node is found that is not full.
In the worst case the root node is split
increasing the height of the tree.

29
Updates on B-Trees Insertion (Cont.)
B-Tree before and after insertion of Clearview
30
Updates on B-Trees Deletion

Find the record to be deleted, and remove it from
the data file and from the bucket (if present).
Remove (search-key value, pointer) from the leaf
node if there is no bucket or if the bucket has
become empty
If the node has too few entries, and the entries
in the node and a sibling fit into a single node,
then
Insert all search-key values in the two nodes
into a single node (the one on the left), and
delete the other node.
Delete the pair (Ki1, Pi), where Pi is the
pointer to the deleted node, from its parent,
recursively using the above procedure.

31
Updates on B-Trees Deletion

Otherwise, if the node has too few entries, but
the entries in the node and a sibling do not fit
into a single node, then
Redistribute the pointers between the node and a
sibling so that both have more than the minimum
number of entries.
Update the corresponding search-key value in the
parent node.
Node deletions may cascade upwards until a node
with ?n/2? or more pointers is found, or
redistribution with a sibling occurs.
If the root node has only one pointer after
deletion, it is deleted and the sole child
becomes the root.

32
Examples of B-Tree Deletion

Before and after deleting Downtown
Note that the removal of the leaf node containing
Downtown did not result in its parent having
too few pointers, so the deletions stopped.

33
Examples of B-Tree Deletion (Perryridge)
Should be Perryridge, based on the code.

Node with Perryridge becomes underfull
(actually empty, in this special case) and merged
with its sibling.
As a result the Perryridge nodes parent became
under-full, and was merged with its sibling (and
an entry was deleted from their parent).
Root node then had only one child, and was
deleted.

34
Example of B-tree Deletion (Perryridge)

Parent of leaf containing Perryridge became
under-full, and borrowed a pointer from its left
sibling.
Search-key value in the parents parent changes
as a result.

35
B-Tree File Organization

Index file degradation problem is solved by using
B-Tree indices.
Data file degradation is eliminated by using
B-Tree File Organization.
The leaf nodes in a B-tree file organization
store the actual records.
Since records are larger than pointers, the
maximum number of records that can be stored in a
leaf node is less than the number of pointers in
a non-leaf node.
Leaf nodes are still required to be half full.
Insertion and deletion are handled in the same
way as insertion and deletion of entries in a
B-tree index.

36
B-Tree File Organization (Cont.)
Example of B-tree File Organization

Good space management is needed since records use
more space than pointers.
To improve space utilization, involve more
siblings in merge and split redistribution.
Involving 2 siblings in redistribution (to avoid
split / merge where possible) results in each
node having at least entries

37
B-Tree Index Files

A B-tree is similar to a B tree, but it only
allows search-key values to appear only once.
Search keys in non-leaf nodes appear nowhere else
in the B-tree an additional pointer field for
each search key in a non-leaf node is included.
Generalized B-tree leaf node
Non-leaf node pointers Bi are the bucket or file
record pointers.

38
B-Tree Index File Example

B-tree and corresponding B-tree on the same data

39
B-Tree Index Files, Cont.

Advantages of B-Tree indices
May use fewer tree nodes than a corresponding
B-Tree.
Sometimes possible to find search-key value
before reaching leaf node.
Disadvantages of B-Tree indices
Only a small fraction of all search-key values
are actually found early.
Non-leaf nodes contain more data, so fan-out is
reduced, and thus, B-Trees typically have greater
depth than corresponding B-Tree.
Insertion and deletion more complicated than in
B-Trees.
Implementation is harder than B-Trees.
Typically, advantages of B-Trees do not out weigh
disadvantages.

40
Static Hashing

A bucket is a unit of storage containing one or
more records.
In the simplest and ideal case a bucket is a disk
block.
Every bucket has an address.
A hash function h is a function from the set of
all search-key values K to the set of all bucket
addresses B.
Typically quick and easy to compute.
In a hash file organization we obtain the bucket
of a record directly from its search-key value
using a hash function.
Used to locate, insert and delete records.
Records with different search-key values are
typically mapped to the same bucket.
Thus an entire bucket has to be searched
sequentially to locate a record.

41
Example of Hash File Organization

Consider a hash file organization of the account
table (next page)
10 buckets (more typically, this is a prime
number)
Search key is branch-name
Typically a hash function performs a computation
on the internal binary representation of the
search-key.
Example
Let the binary representation of the ith
character be integer i
Return the sum of the binary representations of
the characters modulo 10
h(Mianus) is (1391142119) 77 mod 10 7
h(Perryridge) 5
h(Round Hill) 3
h(Brighton) 3

42
Example of Hash File Organization
43
Hash Functions

Ideal hash function is uniform - each bucket is
assigned the same number of search-key values
from the set of all possible search keys.
Ideal hash function is random - on average each
bucket will have the same number of records
assigned to it irrespective of the actual
distribution of search-key values in the file.
Only possible if each search key value is in a
small fraction of records.
See the book for an example hash function that is
uniform but not random.
Worst case the hash function maps all
search-key values to the same bucket this makes
access time proportional to the number of
search-key values in the file.
Extensive research on hash functions has been
done over the years.
Other textbooks cover hash functions to a much
greater degree.

44
Bucket Overflows

Goal one block per bucket.
Bucket overflow occurs when a bucket has been
assigned more records than it has space for.
Bucket overflow can occur because of
Insufficient buckets - lots of records
Skew in record distribution across buckets
multiple records have same search-key value
bad hash function (non-random distribution)
non-prime number of buckets
The probability of bucket overflow can be
reduced, but not eliminated.
Overflow is handled by using overflow buckets.

45
Handling of Bucket Overflows

Overflow Chaining overflow blocks of a given
bucket are a linked.
This is called closed hashing.
An alternative, called open hashing, which does
not use overflow buckets, is not suitable for
database applications.

46
Hash Indices

Hashing can be used not only for file
organization, but also for index-structure
creation.
A hash index organizes the search keys, with
their associated record pointers, into a hash
file structure.
The term hash index is typically used to refer to
both a separate hash-organized index structure
and hash organized files.

47
Example of a Hash Index
Data blocks probably sorted on some attribute
other than account number

Index blocks

48
Hash Indices, Cont.

The book says that strictly speaking, hash
indices are always secondary indices.
This raises the following question
Would a hash index make sense on an attribute on
which the table is sorted?
If so, then, by definition that index would be a
primary index, thereby contradicting the above.
Why the heck not?
Perhaps the point the authors were trying to make
is
If data records are assigned to blocks using
hashing, then the table is not sorted by the
search key, so it cant be referred to as a
primary index.

49
Deficiencies of Static Hashing

In static hashing h maps search-key values to a
fixed set of buckets.
If initial number of buckets is too small,
performance will degrade due to overflows as the
database grows.
If file size in the future is anticipated and the
number of buckets allocated accordingly,
significant amount of space will be wasted
initially.
If the database shrinks, again space will be
wasted.
One option is periodic re-organization of the
file with a new hash function, but this can be
expensive.
These problems can be avoided by using techniques
that allow the number of buckets to be modified
dynamically.

50
Dynamic Hashing

Good for databases that grow and shrink in size.
The hash function and number of buckets change
dynamically.
Extendable hashing is one form of dynamic
hashing.
An extendable hash table has three parts
Hash function
Buckets
Bucket address table

51
Dynamic Hashing Hash Function

Hash function
Generates values over a large range
Typically b-bit integers, with b 32.
At any time only a prefix of the hash function is
used.
The length of the prefix used is i bits, 0 ? i ?
b.
i is referred to as the global depth
The value of i, will grow and shrink along with
the hash table.
i is used to index into the bucket address table

52
Dynamic HashingBucket Address Table

Bucket address table contains 2i entries,
initially i 0.
Each entry contains a pointer to a bucket.
Multiple entries in the bucket address table may
point to the same bucket.
Thus, actual number of buckets is lt 2i

53
Dynamic HashingBuckets

As insertions and deletions take place buckets
are merged and split.
Each bucket has an associated value (address)
called the local depth.
The local depth is equal to the number of bits
that distinguish values in the bucket from the
values in all other buckets.
Bucket chaining may still occur.

54
General Extendable Hash Structure
Global depth
Local depth

Note that it is always the case that ij lt i

55
Use of Extendable Hash Structure

Throughout the following
i is the global depth
ij is the local depth for bucket j.
To locate the bucket containing search-key Kj
1. Compute h(Kj) X
2. Use the first i bits of X as a displacement
into bucket address table, and follow the pointer
to appropriate bucket
To insert a record with search-key value Kj
Follow the above procedure to locate the bucket,
say j.
If there is room in the bucket then insert the
record.
Otherwise split the bucket and re-attempt
insertion (next slide).

56
Updates in Extendable Hash Structure
To split a bucket j when inserting record with
search-key value Kj

If i gt ij (more than one pointer to bucket j)
Allocate a new bucket z, and set ij and iz to ij
1.
Make the second half of the bucket address table
entries currently pointing to j point to z.
Remove and reinsert each record in bucket j.
Re-compute the new bucket for Kj and insert the
record in the bucket (further splitting is
required if the bucket is still full)
If i ij (only one pointer to bucket j)
Increment i and double the size of the bucket
address table.
Replace each entry in the table by two entries
pointing to the same bucket.
Re-compute new bucket address table entry for
KjNow i gt ij so use the first case above.

57
Updates in Extend. Hash Structure (Cont.)

When inserting, if a bucket is still full after
several splits create an overflow bucket instead
of splitting the bucket further.
To delete a record
Locate the record and remove it from its bucket.
The bucket itself can be removed if it becomes
empty (with appropriate updates to the bucket
address table).
Buckets can be coalesced.
Decreasing bucket address table size is also
possible.
Note
Can only coalesce a bucket with a buddy bucket
having same local depth (value of ij) and same ij
1 prefix, if it is present.
Decreasing bucket address table size might be an
expensive operation and should only be done if
the number of buckets becomes much smaller than
the size of the table.

58
Use of Extend. Hash Structure (Example)
Initial Hash structure, bucket size 2
59
Example (Cont.)

Hash structure after insertion of one Brighton
and two Downtown records

60
Example (Cont.)
Hash structure after insertion of Mianus record
61
Example (Cont.)
Hash structure after insertion of three
Perryridge records
62
Example (Cont.)

Hash structure after insertion of Redwood and
Round Hill records

63
Extendable Hashing vs. Other Schemes

Benefits of extendable hashing
Hash performance does not degrade with file
growth.
Minimal space overhead.
Disadvantages of extendable hashing
Extra level of indirection to find desired
record.
Bucket address table may itself become larger
than memory.
Could impose a tree (or other) structure to
locate an index entry.
Changing the size of bucket address table can be
an expensive operation.
Linear hashing is an alternative mechanism which
avoids these disadvantages at the possible cost
of more bucket overflows.

64
Comparison of Ordered Indexing and Hashing

What are the expected type of queries?
Hashing is generally better for point queries.
If range queries are common, ordered indices are
preferred.
Is it desirable to optimize average access time
at the expense of worst-case access time?
What is the cost and frequency of insertions and
deletions?
What is the cost and frequency of periodic
re-organization?

65
Index Definition in SQL

Creating an index
create index ltindex-namegt on ltrelation-namegt
(ltattribute-listgt)
Example
create index b-index on branch(branch-name)
To drop an index
drop index ltindex-namegt
Indices can also be used enforce the condition
that a search key is a candidate key
create unique index

66
Multiple-Key Access

Multiple indices can be used for certain queries
select account-number
from account
where branch-name Perryridge and balance
1000
Possible strategies using single-attribute
indices
1. Use index on branch-name to find accounts with
branch-name Perryridge test balance 1000
in memory.
Use index on balance to find accounts with
balances of 1000 test branch-name
Perryridge in memory.
Use three steps
Use branch-name index to find pointers to all
records pertaining to the Perryridge branch.
Similarly use index on balance.
Take intersection of both sets of pointers
obtained.

67
Indices on Multiple Attributes

An index can be created on more than one
attribute.
A key requirement for ordered indices is the
ability to compare search keys for , lt and gt.
Suppose a multi-attribute index is created on the
attributes (branch-name, balance) in the branch
relation.
How do comparisons work?
, lt, gt, etc.

68
Indices on Multiple Attributes, Cont.

A multi-attribute index can support queries such
as
where branch-name Perryridge and balance
1000
Using separate indices is likely to be less
efficient many records (or pointers) could be
retrieved that satisfy only one of the
conditions.
The multi-attribute index is helpful for queries
such as
where branch-name Perryridge and balance lt
1000
The multi-attribute index is not helpful for
queries such as
where branch-name lt Perryridge and balance
1000
May fetch many records that satisfy the first
but not the second condition.

69
Bitmap Indices

Bitmap indices are a special type of index
designed for efficient querying on multiple
search keys.
Not particularly useful for single attribute
queries
Typically used in data warehouses
Applicable on attributes having a small number of
distinct values
Gender, country, state,
Income-level (0-9999, 10000-19999, 20000-50000,
50000-infinity)
Bitmap assumptions
Records in a relation are numbered sequentially
from, say, 0
Given a number n it must be easy to retrieve
record n
Particularly easy if records are of fixed size

70
Bitmap Indices (Cont.)

A bitmap index on an attribute has a bitmap for
each attribute value.
A bitmap is simply an array of bits
Bitmap has as many bits as there are records in
the file.
In a bitmap for value v, the bit for a record is
1 if the record has the value v for the
attribute, and is 0 otherwise

71
Bitmap Indices (Cont.)

Queries on multiple bitmap-indexed attributes use
bitmap operations
Intersection (and) 100110 AND 110011 100010
Union (or) 100110 OR 110011 110111
Complementation (not) NOT 100110 011001
Example - Retrieve records for males with income
level L1
10010 AND 10100 10000
Resulting bitmap is then used to retrieve
satisfying tuples
Counting the number of satisfying tuples is even
faster

72
Bitmap Indices (Cont.)

How big is a bitmap index?
Bitmap indices are generally very small compared
with relation size.
Example
number of records is n
record size 100 bytes, i.e., 800 bits (somewhat
conservative)
total table size is n 800 bits
bitmap size is n bits
space for a single bitmap is 1/800 of space used
by the relation.
If the number of distinct attribute values is 8,
then bitmap index is only 1 of relation size.
Exercise (do the math) - What happens if each
record is 32 bits and there are 64 distinct
attribute values? Where is the cut-off point?

73
Bitmap Indices (Cont.)

How does a bitmap change if a record is deleted?
Could we simply put a 0 in all bitmaps for the
record?
No! Consider
not(Av)
An existence bitmap could be used to indicate if
there is a valid record at a specific location
for negation to work properly.
(NOT bitmap-A-v) AND ExistenceBitmap
Bitmaps need to be maintained for all values,
including null
To correctly handle SQL (3 value) null semantics
for NOT(Av)
Intersect the above result with (NOT
bitmap-A-Null)

74
Efficient Implementation of Bitmap Operations

Bitmaps are packed into words.
A single word and operation (a basic CPU
instruction) computes the and of 32 or 64 bits at
once.
The conjunction (and) of two bitmaps containing
1-million-bits can be done with just 31,250
instructions.
Counting the number of 1s can be done fast by a
trick
Use each byte to index into a pre-computed array
of 256 elements each storing the count of 1s in
the binary representation.
Add up retrieved counts
Bitmaps can be used instead of Tuple-ID lists at
leaf levels of B-trees, for values that have a
large number of matching records
If a tuple-id is 64-bits, then this saves space
if gt 1/64 of the records have a specific value
(as an exercise, do the math).
This technique merges benefits of bitmap and
B-tree indices..

75
End of Chapter
76
Partitioned Hashing

Hash values are split into segments that depend
on each attribute of the search-key.
(A1, A2, . . . , An) for n attribute search-key
Example n 2, for customer, search-key being
(customer-street, customer-city)
search-key value hash value (Main,
Harrison) 101 111 (Main, Brooklyn) 101
001 (Park, Palo Alto) 010 010 (Spring,
Brooklyn) 001 001 (Alma, Palo Alto) 110 010
To answer equality query on single attribute,
need to look up multiple buckets. Similar in
effect to grid files.

77
Grid Files

A grid file is an index that supports multiple
search-key queries involving one or more
comparison operators.
The grid file consists of
a single n-dimensional grid array, where n is the
number of search key attributes.
n linear scales, one for each search-key
attribute.
Multiple cells of grid array can point to same
bucket.

78
Example Grid File for account
79
Queries on a Grid File

A grid file on two attributes A and B can handle
queries of all following forms with reasonable
efficiency
(a1 ? A ? a2)
(b1 ? B ? b2)
(a1 ? A ? a2 ? b1 ? B ? b2)
During insertion (Similar to extendable hashing,
but on n dimensions)
If a bucket becomes full then it can be split if
more than one cell points to it.
If only one cell points to the bucket, either an
overflow bucket can be created or the grid size
can be increased
During deletion (Also similar to extendable
hashing)
If a bucket becomes empty, the bucket can be
merged with other buckets
Or the grid pointer can be set to null, etc.