Hashing - PowerPoint PPT Presentation

About This Presentation
Title:

Hashing

Description:

... 100 bytes long we would require an array size of 1,000 Megabytes to do this. ... We could have a sorted array of 400 elements and retrieve students using a ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 40
Provided by: ngi52
Learn more at: http://www.cs.bsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Hashing


1
Hashing
  • 8 April 2003

2
Example
  • Consider a situation where we want to make a list
    of records for students currently doing the BSU
    CS degree, with each student uniquely identified
    by a student number.
  • The student numbers currently range from about
    1,000,000 to above 9,999,999 therefore an array
    of 10 million elements would be enough to hold
    all possible student numbers. Given each student
    record is at least 100 bytes long we would
    require an array size of 1,000 Megabytes to do
    this.

3
Example - 2
  • There are fewer than 400 students enrolled in CS
    at present
  • There must be a better way
  • We could have a sorted array of 400 elements and
    retrieve students using a binary search.
  • We want our access to be as fast as possible. In
    this situation we would use a hash table.

4
Example - 3
  • Find some way to transform a student number from
    the several million values to a range closer to
    400 but avoiding (as much as possible) the case
    where two numbers transform (or hash) to the same
    value.
  • We place the records according to their
    transformed key into a new array (or hash table)
    containing at least 400 elements.

5
Example - 4
  • Make the size of the hash table 479 elements
    long.
  • A popular method for transforming keys is to use
    the mod operator (take the remainder upon integer
    division of the original key by the size of the
    hash table)

6
Example - 5
  • For example, consider student number 949,786,456
  • 949786456 479 348
  • Therefore we should place this student in array
    element 348 in the hash table (note the mod
    operator is effective because it can only have
    values in the range 0 - 478).

7
Direct Access Table
  • If we have a collection of n elements whose keys
    are unique integers in (1,m), where m gt n,then
    we can store the items in a direct address table,
    Tm, where Ti is either empty or contains one of
    the n elements. Searching a direct address table
    is an O(1) operation
  • for a key, k, we access Tk,
  • if it contains an element, return it,
  • if it doesn't then return NULL.
  • There are two constraints
  • the keys must be unique, and
  • the range of the keys must be severely bounded.

8
Direct Access Table
9
Using Linked Lists
  • If the keys are not unique, then we can construct
    a set of m lists and store the heads of these
    lists in the direct address table.
  • The time to find an element will still be O(1).
  • If the maximum number of duplicates is ndupmax,
    then searching for a specific element is
    O(ndupmax).

10
Using Linked Lists
  • If duplicates are the exception rather than the
    rule, then ndupmax is much smaller than n and a
    direct address table will provide good
    performance.
  • But if ndupmax approaches n, then the time to
    find a specific element approaches O(n) and some
    other structure such as a tree will be more
    efficient.

11
Using Linked Lists
12
Analysis
  • The range of the keys determines the size of the
    direct address table and may be too large to be
    practical.
  • For instance its not likely that youll be able
    to use a direct address table to store elements
    which have arbitrary 32-bit integers as their
    keys for a few years yet!
  • Direct addressing is easily generalized to the
    case where there is a function, h(k) gt (1,m)
    which maps each value of the key, k, to the range
    (1,m). In this case, we place the element in
    Th(k) rather than Tk and we can search in
    O(1) time as before.

13
Mapping Fuctions
  • The direct address approach requires that the
    function, h(k), is a one-to-one mapping from each
    k to integers in (1,m). Such a function is known
    as a perfect hashing function it maps each key
    to a distinct integer within some manageable
    range and lets us build an O(1) search time
    table.
  • Finding a perfect hashing function is not always
    possible.
  • Sometimes we can find a hash function which maps
    most of the keys onto unique integers, but maps a
    small number of keys onto the same integer.
  • If the number of collisions is sufficiently
    small, then hash tables work well and give O(1)
    search times.

14
Handling Collisions
  • In cases where multiple keys map to the same
    integer, then elements with different keys may be
    stored in the same slot of the hash table.
  • There may be more than one element which should
    be stored in a single slot of the table.
  • Techniques used to manage this problem are
  • chaining
  • overflow areas
  • re-hashing
  • using neighboring slots (linear probing)
  • quadratic probing
  • random probing

15
Chaining
  • One simple scheme is to chain all collisions in
    lists attached to the appropriate slot.
  • Allows an unlimited number of collisions to be
    handled and doesn't require a priori knowledge
  • The tradeoff is the same as with linked lists
    versus array implementations of sets linked
    lists incur overhead in space and, to a lesser
    extent, in time.

16
Chaining
17
How Chaining Works
  • To insert a new item in the table, we hash the
    key to determine
  • which list the item goes on
  • insert the item at the beginning of the list (For
    example, to insert 11, we divide 11 by 8 giving a
    remainder of 3. Thus, 11 goes on the list
    starting at HashTable3)
  • To find an item, we hash the number and then
    follow links in the chain down the list to see if
    it is present.

18
How Chaining Works-2
  • To delete a number, we find the number and remove
    the node from the appropriate linked list.
  • Entries in the hash table are dynamically
    allocated and entered on a linked list associated
    with each hash table entry.
  • Alternative methods, where all entries are stored
    in the hash table itself, are known as direct or
    open addressing.

19
Re-hashing
  • Re-hashing schemes use a second hashing operation
    when there is a collision. If there is a further
    collision, we re-hash until an empty slot in
    the table is found.
  • The re-hashing function can either be a new
    function or a re-application of the original one.
    As long as the functions are applied to a key in
    the same order, then a sought key can always be
    found.

20
Re-Hashing
21
Linear probing
  • One of the simplest re-hashing functions is 1
    (or -1), i.e., on a collision, look in the
    neighboring slot in the table.
  • It calculates the new address extremely quickly.

22
Open Addressing
  • 1. Linear Probing
  • In linear probing, when a collision occurs, the
    new element is put in the next available spot
    (essentially doing a sequential search).
  • Example
  • Insert 49 18 89 48
  • Hash table size 10, so 49 10 9,
  • 18 10 8,
  • 89 10 9,
  • 48 10 8

23
Open Addressing
24
Problems
  • In linear probing records tend to cluster around
    each other. (once an element is placed in the
    hash table the chances of its adjacent element
    being filled are doubledeither filled by a
    collision or directly).
  • If two adjacent elements are filled then the
    chances of the next element being filled is three
    times that for an element with no neighbor.

25
Animation from the Web
  • The animation gives you a practical demonstration
    of the effect of linear probing it also
    implements a quadratic re-hash function so that
    you can see differences.
  • http//ciips.ee.uwa.edu.au/morris/Year2/PLDS210/h
    ash_tables.html

26
Clustering
  • Linear probing is subject to a clustering
    phenomenon.
  • Re-hashes from one location occupy a block of
    slots in the table which grows towards slots
    and blocks to which other keys hash.
  • This exacerbates the collision problem and the
    number of re-hashes can become large.

27
Quadratic Probing
  • Better behavior is usually obtained with
    quadratic probing, where the secondary hash
    function depends on the re-hash index
  • address h(key) c i2
  • On the ith re-hash. (A more complex function of i
    can be used.)
  • Quadratic probing is susceptible to secondary
    clustering since keys which have the same hash
    value also have the same probe sequence
  • Secondary clustering is not nearly as severe as
    clustering caused by linear probing.

28
Overflow area
  • When a collision occurs, a slot in an overflow
    area is used for the new element and a link from
    the primary slot established as in a chained
    system.
  • This is essentially the same as chaining, except
    that the overflow area is pre-allocated and thus
    may be faster to access.
  • As with re-hashing, the maximum number of
    elements must be known in advance, but in this
    case, two parameters must be estimated the
    optimum size of the primary and overflow areas.

29
Overflow Area
30
Comparison
31
Hash Functions
  • If the hash function is uniform (equally
    distributes the data keys among the hash table
    indices), then hashing effectively subdivides the
    list to be searched.
  • Worst-case behavior occurs when all keys hash to
    the same index. Why?
  • It is important to choose a good hash function.

32
Choosing Hash Functions
  • Choice of h hx
  • must be simple
  • must distribute (spread) the data evenly
  • Choice of m m approximates n (about 1
    item/linked list) where n input size

33
Mod Function
  • Choice of a three digit hash for phone numbers
    e.g. 398-3738
  • x is an integer value.hx x mod m.
  • Choosing last three digit(738) is more
    appropriate than the first three digits (398) as
    it distributes the data more evenly.
  • To do this take mod function
  • x mod m
  • hx x mod 10k gives last k digitshx x
    mod 2k gives last k bits

34
Middle Digits of an Integer
  • This often yields unpredictable (and thus good)
    distributions of the data.
  • Assume that you wish to take the two digits three
    positions from the right of x.
  • If x 539872178then hx 72
  • This is obtained byhx (x/1000) mod 100Where
    (x/1000) drops three digits and (x/1000) mod 100
    keeps two digits.

35
Order Preserving Hash Function
  • x lt y implies hxlt hy
  • Application Sorting

36
Perfect Hashing Function
  • A perfect hashing function is one that causes no
    collisions.
  • Perfect hashing functions can be found only under
    certain conditions.
  • One application of the perfect hash function is a
    static dictionary.
  • hx is designed after having peeked at the data.

37
Retrieval
  • To retrieve a record is the same as insertion.
  • Take the key value, perform the same
    transformation as for insertion then look up the
    value in the hash table.

38
Issues
  • There are two basic issues when designing a hash
    algorithm
  • Choosing the best hash function
  • Deciding what to do with collisions

39
Hash Function Strategies
  • If the key is an integer and there is no reason
    to expect a non-random key distribution then the
    modulus operator is a simple (and efficient) and
    effective method.
  • If the key is a string value (e.g. someones name
    or C reserved words) then it first needs to be
    transformed to an integer.
Write a Comment
User Comments (0)
About PowerShow.com