Hashing - PowerPoint PPT Presentation

About This Presentation

Title:

Hashing

Description:

... 100 bytes long we would require an array size of 1,000 Megabytes to do this. ... We could have a sorted array of 400 elements and retrieve students using a ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 40

Provided by: ngi52

Learn more at: http://www.cs.bsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Hashing

1
Hashing

8 April 2003

2
Example

Consider a situation where we want to make a list
of records for students currently doing the BSU
CS degree, with each student uniquely identified
by a student number.
The student numbers currently range from about
1,000,000 to above 9,999,999 therefore an array
of 10 million elements would be enough to hold
all possible student numbers. Given each student
record is at least 100 bytes long we would
require an array size of 1,000 Megabytes to do
this.

3
Example - 2

There are fewer than 400 students enrolled in CS
at present
There must be a better way
We could have a sorted array of 400 elements and
retrieve students using a binary search.
We want our access to be as fast as possible. In
this situation we would use a hash table.

4
Example - 3

Find some way to transform a student number from
the several million values to a range closer to
400 but avoiding (as much as possible) the case
where two numbers transform (or hash) to the same
value.
We place the records according to their
transformed key into a new array (or hash table)
containing at least 400 elements.

5
Example - 4

Make the size of the hash table 479 elements
long.
A popular method for transforming keys is to use
the mod operator (take the remainder upon integer
division of the original key by the size of the
hash table)

6
Example - 5

For example, consider student number 949,786,456
949786456 479 348
Therefore we should place this student in array
element 348 in the hash table (note the mod
operator is effective because it can only have
values in the range 0 - 478).

7
Direct Access Table

If we have a collection of n elements whose keys
are unique integers in (1,m), where m gt n,then
we can store the items in a direct address table,
Tm, where Ti is either empty or contains one of
the n elements. Searching a direct address table
is an O(1) operation
for a key, k, we access Tk,
if it contains an element, return it,
if it doesn't then return NULL.
There are two constraints
the keys must be unique, and
the range of the keys must be severely bounded.

8
Direct Access Table
9
Using Linked Lists

If the keys are not unique, then we can construct
a set of m lists and store the heads of these
lists in the direct address table.
The time to find an element will still be O(1).
If the maximum number of duplicates is ndupmax,
then searching for a specific element is
O(ndupmax).

10
Using Linked Lists

If duplicates are the exception rather than the
rule, then ndupmax is much smaller than n and a
direct address table will provide good
performance.
But if ndupmax approaches n, then the time to
find a specific element approaches O(n) and some
other structure such as a tree will be more
efficient.

11
Using Linked Lists
12
Analysis

The range of the keys determines the size of the
direct address table and may be too large to be
practical.
For instance its not likely that youll be able
to use a direct address table to store elements
which have arbitrary 32-bit integers as their
keys for a few years yet!
Direct addressing is easily generalized to the
case where there is a function, h(k) gt (1,m)
which maps each value of the key, k, to the range
(1,m). In this case, we place the element in
Th(k) rather than Tk and we can search in
O(1) time as before.

13
Mapping Fuctions

The direct address approach requires that the
function, h(k), is a one-to-one mapping from each
k to integers in (1,m). Such a function is known
as a perfect hashing function it maps each key
to a distinct integer within some manageable
range and lets us build an O(1) search time
table.
Finding a perfect hashing function is not always
possible.
Sometimes we can find a hash function which maps
most of the keys onto unique integers, but maps a
small number of keys onto the same integer.
If the number of collisions is sufficiently
small, then hash tables work well and give O(1)
search times.

14
Handling Collisions

In cases where multiple keys map to the same
integer, then elements with different keys may be
stored in the same slot of the hash table.
There may be more than one element which should
be stored in a single slot of the table.
Techniques used to manage this problem are
chaining
overflow areas
re-hashing
using neighboring slots (linear probing)
quadratic probing
random probing

15
Chaining

One simple scheme is to chain all collisions in
lists attached to the appropriate slot.
Allows an unlimited number of collisions to be
handled and doesn't require a priori knowledge
The tradeoff is the same as with linked lists
versus array implementations of sets linked
lists incur overhead in space and, to a lesser
extent, in time.

16
Chaining
17
How Chaining Works

To insert a new item in the table, we hash the
key to determine
which list the item goes on
insert the item at the beginning of the list (For
example, to insert 11, we divide 11 by 8 giving a
remainder of 3. Thus, 11 goes on the list
starting at HashTable3)
To find an item, we hash the number and then
follow links in the chain down the list to see if
it is present.

18
How Chaining Works-2

To delete a number, we find the number and remove
the node from the appropriate linked list.
Entries in the hash table are dynamically
allocated and entered on a linked list associated
with each hash table entry.
Alternative methods, where all entries are stored
in the hash table itself, are known as direct or
open addressing.

19
Re-hashing

Re-hashing schemes use a second hashing operation
when there is a collision. If there is a further
collision, we re-hash until an empty slot in
the table is found.
The re-hashing function can either be a new
function or a re-application of the original one.
As long as the functions are applied to a key in
the same order, then a sought key can always be
found.

20
Re-Hashing
21
Linear probing

One of the simplest re-hashing functions is 1
(or -1), i.e., on a collision, look in the
neighboring slot in the table.
It calculates the new address extremely quickly.

22
Open Addressing

1. Linear Probing
In linear probing, when a collision occurs, the
new element is put in the next available spot
(essentially doing a sequential search).
Example
Insert 49 18 89 48
Hash table size 10, so 49 10 9,
18 10 8,
89 10 9,
48 10 8

23
Open Addressing
24
Problems

In linear probing records tend to cluster around
each other. (once an element is placed in the
hash table the chances of its adjacent element
being filled are doubledeither filled by a
collision or directly).
If two adjacent elements are filled then the
chances of the next element being filled is three
times that for an element with no neighbor.

25
Animation from the Web

The animation gives you a practical demonstration
of the effect of linear probing it also
implements a quadratic re-hash function so that
you can see differences.
http//ciips.ee.uwa.edu.au/morris/Year2/PLDS210/h
ash_tables.html

26
Clustering

Linear probing is subject to a clustering
phenomenon.
Re-hashes from one location occupy a block of
slots in the table which grows towards slots
and blocks to which other keys hash.
This exacerbates the collision problem and the
number of re-hashes can become large.

27
Quadratic Probing

Better behavior is usually obtained with
quadratic probing, where the secondary hash
function depends on the re-hash index
address h(key) c i2
On the ith re-hash. (A more complex function of i
can be used.)
Quadratic probing is susceptible to secondary
clustering since keys which have the same hash
value also have the same probe sequence
Secondary clustering is not nearly as severe as
clustering caused by linear probing.

28
Overflow area

When a collision occurs, a slot in an overflow
area is used for the new element and a link from
the primary slot established as in a chained
system.
This is essentially the same as chaining, except
that the overflow area is pre-allocated and thus
may be faster to access.
As with re-hashing, the maximum number of
elements must be known in advance, but in this
case, two parameters must be estimated the
optimum size of the primary and overflow areas.

29
Overflow Area
30
Comparison
31
Hash Functions

If the hash function is uniform (equally
distributes the data keys among the hash table
indices), then hashing effectively subdivides the
list to be searched.
Worst-case behavior occurs when all keys hash to
the same index. Why?
It is important to choose a good hash function.

32
Choosing Hash Functions

Choice of h hx
must be simple
must distribute (spread) the data evenly
Choice of m m approximates n (about 1
item/linked list) where n input size

33
Mod Function

Choice of a three digit hash for phone numbers
e.g. 398-3738
x is an integer value.hx x mod m.
Choosing last three digit(738) is more
appropriate than the first three digits (398) as
it distributes the data more evenly.
To do this take mod function
x mod m
hx x mod 10k gives last k digitshx x
mod 2k gives last k bits

34
Middle Digits of an Integer

This often yields unpredictable (and thus good)
distributions of the data.
Assume that you wish to take the two digits three
positions from the right of x.
If x 539872178then hx 72
This is obtained byhx (x/1000) mod 100Where
(x/1000) drops three digits and (x/1000) mod 100
keeps two digits.

35
Order Preserving Hash Function

x lt y implies hxlt hy
Application Sorting

36
Perfect Hashing Function

A perfect hashing function is one that causes no
collisions.
Perfect hashing functions can be found only under
certain conditions.
One application of the perfect hash function is a
static dictionary.
hx is designed after having peeked at the data.

37
Retrieval

To retrieve a record is the same as insertion.
Take the key value, perform the same
transformation as for insertion then look up the
value in the hash table.

38
Issues

There are two basic issues when designing a hash
algorithm
Choosing the best hash function
Deciding what to do with collisions

39
Hash Function Strategies

If the key is an integer and there is no reason
to expect a non-random key distribution then the
modulus operator is a simple (and efficient) and
effective method.
If the key is a string value (e.g. someones name
or C reserved words) then it first needs to be
transformed to an integer.

Write a Comment

User Comments (0)