Chap11. Hashing - PowerPoint PPT Presentation

About This Presentation

Title:

Chap11. Hashing

Description:

Introduce the concept of hashing. Examine the problem of choosing a good hashing algorithm, presents a ... Radix transformation(?? ??) 453(10??) - 382(11??) ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 41

Provided by: dbKon

Category:

more less

Transcript and Presenter's Notes

Title: Chap11. Hashing

1
Chap11. Hashing
File Strutures by Folk, Zoellick and Riccardi

???
?????? ???

2
Chapter Objectives

Introduce the concept of hashing
Examine the problem of choosing a good hashing
algorithm, presents a reasonable one in detail,
and describe some others
Explore several approaches for reducing
collisions and storage of several records per
address
Develop and use mathematical tools for analyzing
performance differences resulting from the use of
different hashing techniques
Examine problems associated with file
deterioration (record deletions) and discuss some
solutions
Discuss collision resolution techniques
Examine effects of patterns of record access on
performance

3
Contents

11.1 Introduction
11.2 A Simple Hashing Algorithm
11.3 Hashing Functions and Record Distribution
11.4 How Much Extra Memory Should Be Used?
11.5 Collision Resolution by Progressive Overflow
11.6 Storing More Than One Record per Address
Buckets
11.7 Making Deletions
11.8 Other Collision Resolution Techniques
11.9 Patterns of Record Access

4
Overview

O(1) access to files
Record number is obtained by a hashing function H
applied to the primary key, H(key)
Record numbers generated should be uniformly and
randomly distributed such that 0 lt H(key) lt N
A hash function is like a black box that produces
an address every time you drop in a key
All parts of the key should be used by the
hashing function H so that a lot of records with
similar keys do not all hash to the same location
Given two random keys X, Y and N slots, the
probability H(X)H(Y) is 1/N in this case, X and
Y are called synonyms and a collision occurs

5
Introduction
11.1 Introduction

Hash function h(k)
Transforms a key K into an address
Hash vs other index
Sequential search O(N)
Binary search O(log2N)
B(B) Tree index O(logkN)
where k records in an index node
Hash O(1)

6
A Simple Hashing Scheme (1/2)
11.1 Introduction
Record
Address
key
LOWELL
Address
4
LOWELLs home address
7
A Simple Hashing Scheme (2/2)
11.1 Introduction
ASCII Code for First Two Letters
Home Address
Name
Product
66 X 65 4,290
290
BALL
66 65
76 X 96 6,004
76 96
LOWELL
004
84 X 82 6,888
84 82
TREE
888
8
Hashing differs from indexing

With hashing, the addresses generated appear to
be random
No obvious connection between the key and the
location of the corresponding record
So, hashing is sometimes referred to as
randomizing
With hashing, two different keys may be
translated to the same address
Two records may be sent to the same place in the
file

9
Idea behind Hash-based Files
11.1 Introduction

Record with hash key i is stored in node i
All record with hash key h are stored in node h
Primary blocks of data level nodes are stored
sequentially
Contents of the root node can be expressed by a
simple function Address of data level node for
record with primary key k
address of node 0H(k)
In literature on hash-based files, primary blocks
of data level nodes are called buckets

10
e.g. Hash-based File
11.1 Introduction
11
Collision (1/2)
11.1 Introduction

Collision
Situation in which a record is hashed to an
address that does not have sufficient room to
store the record
Perfect hashing algorithm impossible!
Different key, same hash value
(Different record, same address)

12
Collision (2/2)
11.1 Introduction

Solutions
Spread out the records
Find a hashing algorithm that distributes records
more randomly
Use extra memory
Easier to find a hash algorithm that avoids
collisions if we have a few records to distribute
among many address
Put more than one record at a single address

13
A Simple Hashing Algorithm (1/3)
11.2 A Simple Hashing Algorithm

Step 1. Represent the key in numerical form
If the key is a string take the ASCII code
e.g. LOWELL
76 79 87 69 76 76 32 32 32 32 32 32
L O W E L L ( 6 blanks
)
If the key is a number nothing to be done

14
A Simple Hashing Algorithm (2/3)
11.2 A Simple Hashing Algorithm

Step 2. Fold and Add
Fold
76 79 87 69 76 76 32 32 32 32 32
32
Add parts into one integer
Suppose we use 15 bit integer expression, 32767
is limit
767987697676323232323232 33820 gt 32767
(overflow!)
Largest addend 9090 ( ZZ )
Largest allowable result 32767-9090 23677 -gt
19937(??)
Ensure no intermediate sum exceeds using mod
7679 8769 16448 mod 19937 16448
16448 7676 24124 mod 19937 4187
4187 3232 7419 mod 19937 7419
7419 3232 10651 mod 19937 10651
10651 3232 13883 mod 19937 13883

15
A Simple Hashing Algorithm (3/3)
11.2 A Simple Hashing Algorithm

Step 3. Divide by size of the address space
a s mod n
a home address
s the sum produced in step 2
n the number of addresses in the file
e.g.. a 13883 mod 100 83
A prime number is usually used for the divisor
because primes tend to distribute remainders much
more uniformly than do nonprimes
So, we chose a prime number as close as possible
to the desired size of the address space

16
Hashing Functions and Record Distributions
11.3 Hashing Functions and Record Distributions

Distributing records among address

?? ?? ??? ??? ? ??
Uniform distribution
17
Some other hashing methods
11.3 Hashing Functions and Record Distributions

Better-than-random
Examine keys for a pattern
Fold parts of the key
Divide the key by a number
When the better-than-random methods do not work -
randomize!
Square the key and take the middle
?? ??? ??? ???? ??? ??? ??
Radix transformation(?? ??)
453(10??) -gt 382(11??)

18
How Much Extra Memory Should Be Used?
11.4 How Much Extra Memory Should Be Used?

The more records are packed, the more likely a
collision will occur

19
11.4 How Much Extra Memory Should Be Used?
Poisson Distribution
p(x) the probability that a given address will
have x records assigned to it after the
hashing function has been applied to all n
records ( x records? collision ??? ??)
N the number of available addresses r the
number of records to be stored x the number of
records assigned to a given address
20
Predicting Collisions for Different Packing
Densities
11.4 How Much Extra Memory Should Be Used?

of addresses no record assigned N X P(0)
of addresses one record assigned N X P(1)
of addresses more than two assigned
N X P(2) P(3) P(4) ...
of overflows 1 X NP(2) 2 X NP(3) ...
Percentage of overflow records

21
The larger space, the less overflows
11.4 How Much Extra Memory Should Be Used?
Packing Density r/N (N addresses, r
records)
22
Collision Resolution by Progressive Overflow
11.5 Collision Resolution by Progressive Overflow

Progressive overflow ( linear probing)
Insert a new record
1. Take home address if empty
2. Otherwise, next several addresses are searched
in sequence, until an empty one is found
3. If no more next space - wrapping around

23
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow (1/5)
24
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow (2/5)
25
Progressive Overflow (3/5)
11.5 Collision Resolution by Progressive Overflow

Search a record with a hash function value k
from home address k, look at successive records,
until Found,
or An open address is encountered
Worst case
When the record does not exist and the file is
full
The reason to avoid overflow
Extra searches have to occur when a record is not
found in its home address

26
Progressive Overflow (4/5)
11.5 Collision Resolution by Progressive Overflow

- Search length of accesses required to
retrieve a record (from secondary memory)

27
Progressive Overflow (5/5)
11.5 Collision Resolution by Progressive Overflow

With perfect hashing function average search
length 1
Average search length of no greater than 2.0 are
generally considered acceptable

28
Storing More Than One Record per Address Buckets
11.6 Storing More Than One Record per Address
Buckets

Bucket a block of records sharing the same
address (on block-addressing disk)

29
Effects of Buckets on Performance
11.6 Storing More Than One Record per Address
Buckets

of overflow records
N X 1XP(b1) 2XP(b2) 3XP(b3)...
N of addresses
b of records fit in a bucket
bN of available locations for records
Packing density r/bN
As the bucket size gets larger, performance
continues to improve

30
Bucket Implementation
11.6 Storing More Than One Record per Address
Buckets
31
Bucket Implementation (Cont'd)
11.6 Storing More Than One Record per Address
Buckets

Initializing and Loading
Creating empty space
Use hash values and find the bucket to store
If the home bucket is full, continue to look at
successive buckets
Problems when
No empty space exists
Duplicate keys occur

32
Making Deletions
11.7 Making Deletions

The slot freed by the deletion hinders(disturb)
later searches
Use tombstones and reuse the freed slots

33
Other Collision Resolution Techniques
11.8 Other Collision Resolution Techniques

Double hashing avoid clustering with a second
hash function for overflow records
Chained progressive overflow each home address
contains a pointer to the record with the same
address
Chaining with a separate overflow area move all
overflow records to a separate overflow area
Scatter tables Hash file contains only pointers
to records (like indexing)

34
Linear Probing (1/2)
11.8 Other Collision Resolution Techniques

When a synonym is identified, search forward from
the address given by the hash function (the
natural address) until an empty slot is located,
and store this record there
This is an example of open addressing (examining
a predictable sequence of slots for an empty one)

35
Linear Probing (2/2)
key
E
G
E
A
M
P
L
A
S
E
A
R
C
I
N
X
H
hash
1 0 5 1 18 3 8 9 14 7 5 5 1 13
16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
A
11.8 Other Collision Resolution Techniques
S
insertion sequence
E
A
Memory Space
A
R
C
H
I
N
G
E
E
X
I
E
G
H
E
A
A
A
C
M
P
L
E
X
H
I
G
E
E
36
Rehashing (1/2)
11.8 Other Collision Resolution Techniques

In linear probing, if synonym occurred,
incremented r by 1 and searched next location
In rehashing, use a second hash function for the
displacement
This method has the advantage of avoiding
congestion, because each synonym under the first
hash function likely uses a different
displacement D, and this examines a different
sequence of slots

37
Rehashing(where P3) (2/2)
key
E
G
E
A
M
P
L
A
S
E
A
R
C
I
N
X
H
hash
1 0 5 1 18 3 8 9 14 7 5 5 1 13
16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
A
11.8 Other Collision Resolution Techniques
S
insertion sequence
E
A
Memory Space
A
R
C
H
I
N
G
E
H
E
E
X
N
H
E
A
G
A
A
M
P
L
E
38
Chained progressive overflow
39
Overflow File
11.8 Other Collision Resolution Techniques

When building the file, if a collision occurs,
place the new synonym into a separate area of the
file called the overflow section
??
????? ????
?? ? ??? ?? ??? ?? ??
??
Overflow section? ? ??? ?? ???? ???, ?? ??? 1??
?? ?? ??? ???

40
scatter table