Hash-Based Indexes - PowerPoint PPT Presentation

About This Presentation

Title:

Hash-Based Indexes

Description:

h(k) mod N = bucket to which data entry with key k belongs with N = # of buckets ... instead use overflow chains. Long overflow chains can develop and degrade ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 22

Provided by: RaghuRamak241

Learn more at: http://web.cs.wpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: Hash-Based Indexes

1
Hash-Based Indexes

Chapter 11

2
Introduction Hash-based Indexes

Best for equality selections.
Cannot support range searches.
Static and dynamic hashing techniques exist
Trade-offs similar to ISAM vs. B trees.

3
Static Hashing

h(k) mod N bucket to which data entry with key
k belongs. (N of buckets)

0
h(key) mod N
1
key
h
N-1
Primary bucket pages
Overflow pages
4
Static Hashing h(k) mod N

primary pages fixed (N of buckets)
allocated sequentially
never de-allocated
overflow pages if needed.

0
h(key) mod N
2
key
h
N-1
Primary bucket pages
Overflow pages
5
Static Hashing

h(k) mod N bucket to which data entry with key
k belongs with N of buckets
Hash function works on search key of record r.
h() must distribute values over range 0 ...
N-1.
For example, h(key) (a key b)
a and b are constants
lots known about how to tune h.

6
Static Hashing Cons

Primary pages fixed space ? static structure.
Fixed bucket is the problem
Rehashing can be done ? Not good for search.
In practice, instead use overflow chains.
Long overflow chains can develop and degrade
performance.
Solution Employ dynamic techniques to fix this
problem
Extendible hashing, or
Linear Hashing

7
Extendible Hashing

Problem Bucket (primary page) becomes full.
Solution Why not re-organize file by doubling
of buckets?
Reading and writing all pages is expensive!
Idea Use directory of pointers to buckets
instead of buckets
double of buckets by doubling the directory
split just the bucket that overflowed!
Discussion
Directory much smaller than file, so doubling
is cheaper. Only one page of data entries is
split.
No overflow pages ever.
Trick lies in how hash function is adjusted!

8
Example
LOCAL DEPTH
2
Bucket A

Directory array4
To find bucket for r, take last global depth
bits of function h(r)

16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
00
13
1
21
5
01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
9
Example
LOCAL DEPTH
2
Bucket A

If h(r) 5 101,
it is in bucket pointed to by 01.
If h(r) 4 100,
it is in bucket pointed to by 00.

16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
00
13
1
21
5
01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
10
Insertion
2
LOCAL DEPTH
Bucket A
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5

Insert If bucket is
full, split it (allocate new page, re-distribute
content).

01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES

Splitting may
double the directory, or
simply link in a new page.
To tell what to do
Compare global depth with local depth for split
bucket.

11
Insert h(r) 6 binary 110
12
Insert h(r) 6 binary 110
13
Insert h(r)20 binary 10100
14
Insert h(r)20
Split Bucket A into two buckets A1 and A2.
3
20
Bucket A2
4
12
3
Bucket A1
32
16
15
Insert h(r)20
Bucket A
LOCAL DEPTH
2
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5
01
2
10
Bucket C
10
11
Bucket D
2
DIRECTORY
15
7
19
DATA PAGES
16
Insert h(r)20
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
17
Points to Note

20 binary 10100.
Last 2 bits (00) tell us r belongs in A (or not
A).
Last 3 bits needed to tell A1 or A2
Using bits
Global depth of directory Max of bits needed
to tell which bucket an entry belongs to
Local depth of a bucket of bits used to
determine if an entry belongs to this bucket.

18
More Points to Note

Bits
Global depth of directory Max of bits needed
to tell which bucket an entry belongs to
Local depth of a bucket of bits used to
determine if an entry belongs to this bucket.
When does bucket split cause directory doubling?
Before insert, local depth of bucket global
depth.
Insert causes local depth to become lt global
depth
Directory is doubled by copying it over and
fixing pointer to split image page.

19
Extendible Hashing Delete

Delete
If removal of data entry makes bucket empty, can
be merged with split image.
If each directory element points to same bucket
as its split image, can halve directory.

20
Comments on Extendible Hashing

If directory fits in memory,
then equality search answered with one disk
access else two.
100MB file, 100 bytes/rec, 4K pages contain
1,000,000 records (as data entries) and 25,000
directory elements chances are high that
directory will fit in memory.
Directory grows in spurts.
If the distribution of hash values is skewed,
directory can grow large.

21
Summary

Hash-based indexes best for equality searches,
cannot support range searches.
Static Hashing can lead to long overflow chains.
Extendible Hashing avoids overflow pages by
splitting full bucket when new data to be added
Directory to keep track of buckets, doubles
periodically

Write a Comment

User Comments (0)