Title: Chap11. Hashing
1Chap11. Hashing
File Strutures by Folk, Zoellick and Riccardi
2Chapter Objectives
- Introduce the concept of hashing
- Examine the problem of choosing a good hashing
algorithm, presents a reasonable one in detail,
and describe some others - Explore several approaches for reducing
collisions and storage of several records per
address - Develop and use mathematical tools for analyzing
performance differences resulting from the use of
different hashing techniques - Examine problems associated with file
deterioration (record deletions) and discuss some
solutions - Discuss collision resolution techniques
- Examine effects of patterns of record access on
performance
3Contents
- 11.1 Introduction
- 11.2 A Simple Hashing Algorithm
- 11.3 Hashing Functions and Record Distribution
- 11.4 How Much Extra Memory Should Be Used?
- 11.5 Collision Resolution by Progressive Overflow
- 11.6 Storing More Than One Record per Address
Buckets - 11.7 Making Deletions
- 11.8 Other Collision Resolution Techniques
- 11.9 Patterns of Record Access
4Overview
- O(1) access to files
- Record number is obtained by a hashing function H
applied to the primary key, H(key) - Record numbers generated should be uniformly and
randomly distributed such that 0 lt H(key) lt N - A hash function is like a black box that produces
an address every time you drop in a key - All parts of the key should be used by the
hashing function H so that a lot of records with
similar keys do not all hash to the same location - Given two random keys X, Y and N slots, the
probability H(X)H(Y) is 1/N in this case, X and
Y are called synonyms and a collision occurs
5 Introduction
11.1 Introduction
- Hash function h(k)
- Transforms a key K into an address
- Hash vs other index
- Sequential search O(N)
- Binary search O(log2N)
- B(B) Tree index O(logkN)
- where k records in an index node
- Hash O(1)
6A Simple Hashing Scheme (1/2)
11.1 Introduction
Record
Address
key
LOWELL
Address
4
LOWELLs home address
7A Simple Hashing Scheme (2/2)
11.1 Introduction
ASCII Code for First Two Letters
Home Address
Name
Product
66 X 65 4,290
290
BALL
66 65
76 X 96 6,004
76 96
LOWELL
004
84 X 82 6,888
84 82
TREE
888
8Hashing differs from indexing
- With hashing, the addresses generated appear to
be random - No obvious connection between the key and the
location of the corresponding record - So, hashing is sometimes referred to as
randomizing - With hashing, two different keys may be
translated to the same address - Two records may be sent to the same place in the
file
9Idea behind Hash-based Files
11.1 Introduction
- Record with hash key i is stored in node i
- All record with hash key h are stored in node h
- Primary blocks of data level nodes are stored
sequentially - Contents of the root node can be expressed by a
simple function Address of data level node for
record with primary key k - address of node 0H(k)
- In literature on hash-based files, primary blocks
of data level nodes are called buckets
10e.g. Hash-based File
11.1 Introduction
11Collision (1/2)
11.1 Introduction
- Collision
- Situation in which a record is hashed to an
address that does not have sufficient room to
store the record - Perfect hashing algorithm impossible!
- Different key, same hash value
- (Different record, same address)
12Collision (2/2)
11.1 Introduction
- Solutions
- Spread out the records
- Find a hashing algorithm that distributes records
more randomly - Use extra memory
- Easier to find a hash algorithm that avoids
collisions if we have a few records to distribute
among many address - Put more than one record at a single address
13A Simple Hashing Algorithm (1/3)
11.2 A Simple Hashing Algorithm
- Step 1. Represent the key in numerical form
- If the key is a string take the ASCII code
- e.g. LOWELL
- 76 79 87 69 76 76 32 32 32 32 32 32
- L O W E L L ( 6 blanks
) - If the key is a number nothing to be done
14A Simple Hashing Algorithm (2/3)
11.2 A Simple Hashing Algorithm
- Step 2. Fold and Add
- Fold
- 76 79 87 69 76 76 32 32 32 32 32
32 - Add parts into one integer
- Suppose we use 15 bit integer expression, 32767
is limit - 767987697676323232323232 33820 gt 32767
(overflow!) - Largest addend 9090 ( ZZ )
- Largest allowable result 32767-9090 23677 -gt
19937(??) - Ensure no intermediate sum exceeds using mod
- 7679 8769 16448 mod 19937 16448
- 16448 7676 24124 mod 19937 4187
- 4187 3232 7419 mod 19937 7419
- 7419 3232 10651 mod 19937 10651
- 10651 3232 13883 mod 19937 13883
15A Simple Hashing Algorithm (3/3)
11.2 A Simple Hashing Algorithm
- Step 3. Divide by size of the address space
- a s mod n
- a home address
- s the sum produced in step 2
- n the number of addresses in the file
- e.g.. a 13883 mod 100 83
- A prime number is usually used for the divisor
because primes tend to distribute remainders much
more uniformly than do nonprimes - So, we chose a prime number as close as possible
to the desired size of the address space
16 Hashing Functions and Record Distributions
11.3 Hashing Functions and Record Distributions
- Distributing records among address
?? ?? ??? ??? ? ??
Uniform distribution
17Some other hashing methods
11.3 Hashing Functions and Record Distributions
- Better-than-random
- Examine keys for a pattern
- Fold parts of the key
- Divide the key by a number
- When the better-than-random methods do not work -
randomize! - Square the key and take the middle
- ?? ??? ??? ???? ??? ??? ??
- Radix transformation(?? ??)
- 453(10??) -gt 382(11??)
18How Much Extra Memory Should Be Used?
11.4 How Much Extra Memory Should Be Used?
- The more records are packed, the more likely a
collision will occur
19 11.4 How Much Extra Memory Should Be Used?
Poisson Distribution
p(x) the probability that a given address will
have x records assigned to it after the
hashing function has been applied to all n
records ( x records? collision ??? ??)
N the number of available addresses r the
number of records to be stored x the number of
records assigned to a given address
20Predicting Collisions for Different Packing
Densities
11.4 How Much Extra Memory Should Be Used?
- of addresses no record assigned N X P(0)
- of addresses one record assigned N X P(1)
- of addresses more than two assigned
- N X P(2) P(3) P(4) ...
- of overflows 1 X NP(2) 2 X NP(3) ...
- Percentage of overflow records
21The larger space, the less overflows
11.4 How Much Extra Memory Should Be Used?
Packing Density r/N (N addresses, r
records)
22Collision Resolution by Progressive Overflow
11.5 Collision Resolution by Progressive Overflow
- Progressive overflow ( linear probing)
- Insert a new record
- 1. Take home address if empty
- 2. Otherwise, next several addresses are searched
in sequence, until an empty one is found - 3. If no more next space - wrapping around
23 11.5 Collision Resolution by Progressive Overflow
Progressive Overflow (1/5)
24 11.5 Collision Resolution by Progressive Overflow
Progressive Overflow (2/5)
25Progressive Overflow (3/5)
11.5 Collision Resolution by Progressive Overflow
- Search a record with a hash function value k
- from home address k, look at successive records,
until Found, - or An open address is encountered
- Worst case
- When the record does not exist and the file is
full - The reason to avoid overflow
- Extra searches have to occur when a record is not
found in its home address
26Progressive Overflow (4/5)
11.5 Collision Resolution by Progressive Overflow
- - Search length of accesses required to
retrieve a record (from secondary memory)
27Progressive Overflow (5/5)
11.5 Collision Resolution by Progressive Overflow
- With perfect hashing function average search
length 1 - Average search length of no greater than 2.0 are
generally considered acceptable
28Storing More Than One Record per Address Buckets
11.6 Storing More Than One Record per Address
Buckets
- Bucket a block of records sharing the same
address (on block-addressing disk)
29Effects of Buckets on Performance
11.6 Storing More Than One Record per Address
Buckets
- of overflow records
- N X 1XP(b1) 2XP(b2) 3XP(b3)...
- N of addresses
- b of records fit in a bucket
- bN of available locations for records
- Packing density r/bN
- As the bucket size gets larger, performance
continues to improve
30Bucket Implementation
11.6 Storing More Than One Record per Address
Buckets
31Bucket Implementation (Cont'd)
11.6 Storing More Than One Record per Address
Buckets
- Initializing and Loading
- Creating empty space
- Use hash values and find the bucket to store
- If the home bucket is full, continue to look at
successive buckets - Problems when
- No empty space exists
- Duplicate keys occur
32Making Deletions
11.7 Making Deletions
- The slot freed by the deletion hinders(disturb)
later searches - Use tombstones and reuse the freed slots
33Other Collision Resolution Techniques
11.8 Other Collision Resolution Techniques
- Double hashing avoid clustering with a second
hash function for overflow records - Chained progressive overflow each home address
contains a pointer to the record with the same
address - Chaining with a separate overflow area move all
overflow records to a separate overflow area - Scatter tables Hash file contains only pointers
to records (like indexing)
34Linear Probing (1/2)
11.8 Other Collision Resolution Techniques
- When a synonym is identified, search forward from
the address given by the hash function (the
natural address) until an empty slot is located,
and store this record there - This is an example of open addressing (examining
a predictable sequence of slots for an empty one)
35Linear Probing (2/2)
key
E
G
E
A
M
P
L
A
S
E
A
R
C
I
N
X
H
hash
1 0 5 1 18 3 8 9 14 7 5 5 1 13
16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
A
11.8 Other Collision Resolution Techniques
S
insertion sequence
E
A
Memory Space
A
R
C
H
I
N
G
E
E
X
I
E
G
H
E
A
A
A
C
M
P
L
E
X
H
I
G
E
E
36Rehashing (1/2)
11.8 Other Collision Resolution Techniques
- In linear probing, if synonym occurred,
incremented r by 1 and searched next location - In rehashing, use a second hash function for the
displacement - This method has the advantage of avoiding
congestion, because each synonym under the first
hash function likely uses a different
displacement D, and this examines a different
sequence of slots
37Rehashing(where P3) (2/2)
key
E
G
E
A
M
P
L
A
S
E
A
R
C
I
N
X
H
hash
1 0 5 1 18 3 8 9 14 7 5 5 1 13
16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
A
11.8 Other Collision Resolution Techniques
S
insertion sequence
E
A
Memory Space
A
R
C
H
I
N
G
E
H
E
E
X
N
H
E
A
G
A
A
M
P
L
E
38Chained progressive overflow
39Overflow File
11.8 Other Collision Resolution Techniques
- When building the file, if a collision occurs,
place the new synonym into a separate area of the
file called the overflow section - ??
- ????? ????
- ?? ? ??? ?? ??? ?? ??
- ??
- Overflow section? ? ??? ?? ???? ???, ?? ??? 1??
?? ?? ??? ???
40scatter table
- an index that is searched by hashing
- the search of the index requires only one access
- a set of linked lists of synonyms