Hashing (Ch. 14) - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Hashing (Ch. 14)

Description:

Birthday paradox. M sqrt(p M/2) (about 1.25 sqrt(M)) 100 12. 1000 40 ... Amortized analysis? The result above is actually an amortized result for the rehashing. ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 16

Provided by: HarryPl6

Learn more at: http://cs.calvin.edu

Category:

more less

Transcript and Presenter's Notes

Title: Hashing (Ch. 14)

1
Hashing (Ch. 14)

Application kalah end-game book
Operations insert, find
Dictionary insert, delete, search
Are O(log n) comparisons necessary? (no)
Hashing basic plan
create a big array for the items to be stored
use a function to figure out storage location
from key (hash function)
a collision resolution scheme is necessary

2
Hashing Example

Simple Hash function
Treat the key as a large integer K
h(K) K mod M, where M is the table size
let M be a prime number.
Example
Suppose we have 101 buckets in the hash table.
abcd in hex is 0x61626364
Converted to decimal its 1633831724
1633831724 101 11
Thus h(abcd) 11. Store the key at location
11.
dcba hashes to 57.
abbc also hashes to 57 collision. What to
do?
If you have billions of possible keys and
hundreds of buckets, lots of collisions are
possible!

3
Hashing Strings

h(aVeryLongVariableName)?
Instead of dealing with very large numbers, you
can use Horners method
256 97 86 24918 101 72
256 72 101 18533 101 50
256 50 114 12914 101 87
Scramble by replacing 256 with 117

int hash(char v, int M) int h, a117 for
(h0 v v) h (ah v) M
return h
4
Collisions

How likely are collisions?
Birthday paradox
M sqrt(p M/2) (about 1.25 sqrt(M))
100 12
1000 40
10000 125
1.25 sqrt(365) is about 24
Experiment generate random numbers 0..100
84 35 45 32 89 1 58 16 38 69 5 90 16 16 53 61
Collision at 13th number, as predicted
What to do about collisions?

5
Separate Chaining

Build a linked list for each bucket
Linear search within list
01 L A A A2 M X3 N C45 E P E E6 7
G R8 H S9 I10
Simple, practical, widely used
Cuts search time by a factor of M over sequential
search

6
Separate Chaining 2

Insertion time?
O(1)
Average search cost, successful search?
O(N/2M)
Average search cost, unsuccessful?
O(N/M)
M large CONSTANT average search time
Worst case N (probabilistically unlikely)
Keep lists sorted?
insert time O(N/2M)
unsuccessful search time O(N/2M)

7
Linear Probing

Or, we could keep everything in the same table
Insert upon collision, search for a free spot
Search same (ifyou find one, fail)
Runtime?
Still O(1) if tableis sparse
But as table fills,clustering occurs
Skipping c spotsdoesnt help

8
Clustering

Long clusters tend to get longer
Precise analysis difficult
Theorem (Knuth)
Insert cost approx. (1 1/(1-N/M)2)/2
(50 full ? 2.5 probes 80 full ? 13 probes)
Search (hit) cost approx. (1 1/(1-N/M))/2
(50 full ? 1.5 probes 80 full ? 3 probes)
Search (miss) same as insert
Too slow when table gets 70-80 full
How to reduce/avoid clustering?

9
Double Hashing

Use a second hash function to compute increment
seq.
Analysis extremely difficult
About like ideal (random probe)
Thm (Guibas-Szemeredi)
Insert approx 11/(1-N/M)
Search hit ln(1N/M)/(N/M)
Search miss same as insert
Not too slow until the table isabout 90 full

10
Dynamic Hash Tables

Suppose you are making a symbol table for a
compiler. How big should you make the hash
table?
If you dont know in advance how big a table to
make, what to do?
Could grow the table when it fills (e.g. 50
full)
Make a new table of twice the size.
Make a new hash function
Re-hash all of the items in the new table
Dispose of the old table

11
Table Growing Analysis

Worst case insertion Q(n), to re-hash all items
Can we make any better statements?
Average case?
O(1), since insertions n through 2n cost O(n) (on
average) for insertions and O(2n) (on average)
for rehashing ? O(n) total (with 3x the constant)
Amortized analysis?
The result above is actually an amortized result
for the rehashing.
Any sequence of j insertions into an empty table
has O(j) average cost for insertions and O(2j)
for rehashing.
Or, think of it as billing 3 time units for each
insertion, storing 2 in the bank. Withdraw them
later for rehashing.

12
Separate Chaining vs.Double Hashing

Assume the same amount of space for keys, links
(use pointers for long or variable-length keys)
Separate chaining
1M buckets, 4M keys
4M links in nodes
9M words total avg search time 2
Double hashing in same space
4M items, 9M buckets in table
average search time 1/(1-4/9) 1.8 10 faster
Double hashing in same time
4M items, average search time 2
space needed 8M words (1/(1-4/8) 2) (11 less
space)

13
Deletion

How to implement delete() with separate chaining?
Simply unlink unwanted item
Runtime?
Same as search()
How to implement delete() with linear probing?
Cant just erase it. (Why not?)
Re-hash entire cluster
Or mark as deleted?
How to delete() with double hashing?
Re-hashing cluster doesnt work which
cluster?
Mark as deleted
Every so often re-hash entire table to prune
dead-wood

14
Comparisons and summary

Separate chaining advantages
idiot-proof (degrades gracefully)
no large chunks of memory needed (but is this
better?)
Why use hashing?
constant time search and insert, on average
easy to implement
Why not use hashing?
No performance guarantees
Uses extra space
Doesnt support pred, succ, sort, etc. no
notion of order
Where did perl hashes get their name?

15
Hashing Summary

Separate chaining easiest to deploy
Linear probing fastest (but takes more memory)
Double hashing least memory (but takes more time
to compute the second hash function)
Dynamic (grow) handles any number of inserts
Curious use of hashing early unix spell checker
(back in the days of the 3M machines)

Construction Search
Miss RB Chain Probe Dbl Grow RB Chain
Probe Dbl Grow 5k 6 1 4 4 3
2 1 0 1 0 50k 74 18 11
12 22 36 15 8 8 8 100k 182
35 21 23 47 84 45 23 21
15 190k 79 106 59 155 144
2194 261 30 200k 407 84 159
186 156 33

Write a Comment

User Comments (0)