Hashing (Ch. 14) - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Hashing (Ch. 14)

Description:

Birthday paradox. M sqrt(p M/2) (about 1.25 sqrt(M)) 100 12. 1000 40 ... Amortized analysis? The result above is actually an amortized result for the rehashing. ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 16
Provided by: HarryPl6
Learn more at: http://cs.calvin.edu
Category:

less

Transcript and Presenter's Notes

Title: Hashing (Ch. 14)


1
Hashing (Ch. 14)
  • Application kalah end-game book
  • Operations insert, find
  • Dictionary insert, delete, search
  • Are O(log n) comparisons necessary? (no)
  • Hashing basic plan
  • create a big array for the items to be stored
  • use a function to figure out storage location
    from key (hash function)
  • a collision resolution scheme is necessary

2
Hashing Example
  • Simple Hash function
  • Treat the key as a large integer K
  • h(K) K mod M, where M is the table size
  • let M be a prime number.
  • Example
  • Suppose we have 101 buckets in the hash table.
  • abcd in hex is 0x61626364
  • Converted to decimal its 1633831724
  • 1633831724 101 11
  • Thus h(abcd) 11. Store the key at location
    11.
  • dcba hashes to 57.
  • abbc also hashes to 57 collision. What to
    do?
  • If you have billions of possible keys and
    hundreds of buckets, lots of collisions are
    possible!

3
Hashing Strings
  • h(aVeryLongVariableName)?
  • Instead of dealing with very large numbers, you
    can use Horners method
  • 256 97 86 24918 101 72
  • 256 72 101 18533 101 50
  • 256 50 114 12914 101 87
  • Scramble by replacing 256 with 117

int hash(char v, int M) int h, a117 for
(h0 v v) h (ah v) M
return h
4
Collisions
  • How likely are collisions?
  • Birthday paradox
  • M sqrt(p M/2) (about 1.25 sqrt(M))
  • 100 12
  • 1000 40
  • 10000 125
  • 1.25 sqrt(365) is about 24
  • Experiment generate random numbers 0..100
  • 84 35 45 32 89 1 58 16 38 69 5 90 16 16 53 61
  • Collision at 13th number, as predicted
  • What to do about collisions?

5
Separate Chaining
  • Build a linked list for each bucket
  • Linear search within list
  • 01 L A A A2 M X3 N C45 E P E E6 7
    G R8 H S9 I10
  • Simple, practical, widely used
  • Cuts search time by a factor of M over sequential
    search

6
Separate Chaining 2
  • Insertion time?
  • O(1)
  • Average search cost, successful search?
  • O(N/2M)
  • Average search cost, unsuccessful?
  • O(N/M)
  • M large CONSTANT average search time
  • Worst case N (probabilistically unlikely)
  • Keep lists sorted?
  • insert time O(N/2M)
  • unsuccessful search time O(N/2M)

7
Linear Probing
  • Or, we could keep everything in the same table
  • Insert upon collision, search for a free spot
  • Search same (ifyou find one, fail)
  • Runtime?
  • Still O(1) if tableis sparse
  • But as table fills,clustering occurs
  • Skipping c spotsdoesnt help

8
Clustering
  • Long clusters tend to get longer
  • Precise analysis difficult
  • Theorem (Knuth)
  • Insert cost approx. (1 1/(1-N/M)2)/2
  • (50 full ? 2.5 probes 80 full ? 13 probes)
  • Search (hit) cost approx. (1 1/(1-N/M))/2
  • (50 full ? 1.5 probes 80 full ? 3 probes)
  • Search (miss) same as insert
  • Too slow when table gets 70-80 full
  • How to reduce/avoid clustering?

9
Double Hashing
  • Use a second hash function to compute increment
    seq.
  • Analysis extremely difficult
  • About like ideal (random probe)
  • Thm (Guibas-Szemeredi)
  • Insert approx 11/(1-N/M)
  • Search hit ln(1N/M)/(N/M)
  • Search miss same as insert
  • Not too slow until the table isabout 90 full

10
Dynamic Hash Tables
  • Suppose you are making a symbol table for a
    compiler. How big should you make the hash
    table?
  • If you dont know in advance how big a table to
    make, what to do?
  • Could grow the table when it fills (e.g. 50
    full)
  • Make a new table of twice the size.
  • Make a new hash function
  • Re-hash all of the items in the new table
  • Dispose of the old table

11
Table Growing Analysis
  • Worst case insertion Q(n), to re-hash all items
  • Can we make any better statements?
  • Average case?
  • O(1), since insertions n through 2n cost O(n) (on
    average) for insertions and O(2n) (on average)
    for rehashing ? O(n) total (with 3x the constant)
  • Amortized analysis?
  • The result above is actually an amortized result
    for the rehashing.
  • Any sequence of j insertions into an empty table
    has O(j) average cost for insertions and O(2j)
    for rehashing.
  • Or, think of it as billing 3 time units for each
    insertion, storing 2 in the bank. Withdraw them
    later for rehashing.

12
Separate Chaining vs.Double Hashing
  • Assume the same amount of space for keys, links
    (use pointers for long or variable-length keys)
  • Separate chaining
  • 1M buckets, 4M keys
  • 4M links in nodes
  • 9M words total avg search time 2
  • Double hashing in same space
  • 4M items, 9M buckets in table
  • average search time 1/(1-4/9) 1.8 10 faster
  • Double hashing in same time
  • 4M items, average search time 2
  • space needed 8M words (1/(1-4/8) 2) (11 less
    space)

13
Deletion
  • How to implement delete() with separate chaining?
  • Simply unlink unwanted item
  • Runtime?
  • Same as search()
  • How to implement delete() with linear probing?
  • Cant just erase it. (Why not?)
  • Re-hash entire cluster
  • Or mark as deleted?
  • How to delete() with double hashing?
  • Re-hashing cluster doesnt work which
    cluster?
  • Mark as deleted
  • Every so often re-hash entire table to prune
    dead-wood

14
Comparisons and summary
  • Separate chaining advantages
  • idiot-proof (degrades gracefully)
  • no large chunks of memory needed (but is this
    better?)
  • Why use hashing?
  • constant time search and insert, on average
  • easy to implement
  • Why not use hashing?
  • No performance guarantees
  • Uses extra space
  • Doesnt support pred, succ, sort, etc. no
    notion of order
  • Where did perl hashes get their name?

15
Hashing Summary
  • Separate chaining easiest to deploy
  • Linear probing fastest (but takes more memory)
  • Double hashing least memory (but takes more time
    to compute the second hash function)
  • Dynamic (grow) handles any number of inserts
  • Curious use of hashing early unix spell checker
    (back in the days of the 3M machines)

Construction Search
Miss RB Chain Probe Dbl Grow RB Chain
Probe Dbl Grow 5k 6 1 4 4 3
2 1 0 1 0 50k 74 18 11
12 22 36 15 8 8 8 100k 182
35 21 23 47 84 45 23 21
15 190k 79 106 59 155 144
2194 261 30 200k 407 84 159
186 156 33
Write a Comment
User Comments (0)
About PowerShow.com