Hash-Based Indexes - PowerPoint PPT Presentation

About This Presentation
Title:

Hash-Based Indexes

Description:

h(k) mod N = bucket to which data entry with key k belongs with N = # of buckets ... instead use overflow chains. Long overflow chains can develop and degrade ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 22
Provided by: RaghuRamak241
Learn more at: http://web.cs.wpi.edu
Category:
Tags: based | chains | hash | indexes | key

less

Transcript and Presenter's Notes

Title: Hash-Based Indexes


1
Hash-Based Indexes
  • Chapter 11

2
Introduction Hash-based Indexes
  • Best for equality selections.
  • Cannot support range searches.
  • Static and dynamic hashing techniques exist
  • Trade-offs similar to ISAM vs. B trees.

3
Static Hashing
  • h(k) mod N bucket to which data entry with key
    k belongs. (N of buckets)

0
h(key) mod N
1
key
h
N-1
Primary bucket pages
Overflow pages
4
Static Hashing h(k) mod N
  • primary pages fixed (N of buckets)
  • allocated sequentially
  • never de-allocated
  • overflow pages if needed.

0
h(key) mod N
2
key
h
N-1
Primary bucket pages
Overflow pages
5
Static Hashing
  • h(k) mod N bucket to which data entry with key
    k belongs with N of buckets
  • Hash function works on search key of record r.
  • h() must distribute values over range 0 ...
    N-1.
  • For example, h(key) (a key b)
  • a and b are constants
  • lots known about how to tune h.

6
Static Hashing Cons
  • Primary pages fixed space ? static structure.
  • Fixed bucket is the problem
  • Rehashing can be done ? Not good for search.
  • In practice, instead use overflow chains.
  • Long overflow chains can develop and degrade
    performance.
  • Solution Employ dynamic techniques to fix this
    problem
  • Extendible hashing, or
  • Linear Hashing

7
Extendible Hashing
  • Problem Bucket (primary page) becomes full.
  • Solution Why not re-organize file by doubling
    of buckets?
  • Reading and writing all pages is expensive!
  • Idea Use directory of pointers to buckets
    instead of buckets
  • double of buckets by doubling the directory
  • split just the bucket that overflowed!
  • Discussion
  • Directory much smaller than file, so doubling
    is cheaper. Only one page of data entries is
    split.
  • No overflow pages ever.
  • Trick lies in how hash function is adjusted!

8
Example
LOCAL DEPTH
2
Bucket A
  • Directory array4
  • To find bucket for r, take last global depth
    bits of function h(r)

16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
00
13
1
21
5
01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
9
Example
LOCAL DEPTH
2
Bucket A
  • If h(r) 5 101,
  • it is in bucket pointed to by 01.
  • If h(r) 4 100,
  • it is in bucket pointed to by 00.

16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
00
13
1
21
5
01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
10
Insertion
2
LOCAL DEPTH
Bucket A
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5
  • Insert If bucket is
    full, split it (allocate new page, re-distribute
    content).

01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
  • Splitting may
  • double the directory, or
  • simply link in a new page.
  • To tell what to do
  • Compare global depth with local depth for split
    bucket.

11
Insert h(r) 6 binary 110
12
Insert h(r) 6 binary 110
13
Insert h(r)20 binary 10100
14
Insert h(r)20
Split Bucket A into two buckets A1 and A2.
3
20
Bucket A2
4
12
3
Bucket A1
32
16
15
Insert h(r)20
Bucket A
LOCAL DEPTH
2
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5
01
2
10
Bucket C
10
11
Bucket D
2
DIRECTORY
15
7
19
DATA PAGES
16
Insert h(r)20
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
17
Points to Note
  • 20 binary 10100.
  • Last 2 bits (00) tell us r belongs in A (or not
    A).
  • Last 3 bits needed to tell A1 or A2
  • Using bits
  • Global depth of directory Max of bits needed
    to tell which bucket an entry belongs to
  • Local depth of a bucket of bits used to
    determine if an entry belongs to this bucket.

18
More Points to Note
  • Bits
  • Global depth of directory Max of bits needed
    to tell which bucket an entry belongs to
  • Local depth of a bucket of bits used to
    determine if an entry belongs to this bucket.
  • When does bucket split cause directory doubling?
  • Before insert, local depth of bucket global
    depth.
  • Insert causes local depth to become lt global
    depth
  • Directory is doubled by copying it over and
    fixing pointer to split image page.

19
Extendible Hashing Delete
  • Delete
  • If removal of data entry makes bucket empty, can
    be merged with split image.
  • If each directory element points to same bucket
    as its split image, can halve directory.

20
Comments on Extendible Hashing
  • If directory fits in memory,
    then equality search answered with one disk
    access else two.
  • 100MB file, 100 bytes/rec, 4K pages contain
    1,000,000 records (as data entries) and 25,000
    directory elements chances are high that
    directory will fit in memory.
  • Directory grows in spurts.
  • If the distribution of hash values is skewed,
    directory can grow large.

21
Summary
  • Hash-based indexes best for equality searches,
    cannot support range searches.
  • Static Hashing can lead to long overflow chains.
  • Extendible Hashing avoids overflow pages by
    splitting full bucket when new data to be added
  • Directory to keep track of buckets, doubles
    periodically
Write a Comment
User Comments (0)
About PowerShow.com