Hash Tables - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Hash Tables

Description:

Hash Tables. Objectives. Exploit random access to ... Using Multiplication ... is the key, and j indicates how many times we've previously hashed this same key. ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 16
Provided by: usersEc3
Category:

less

Transcript and Presenter's Notes

Title: Hash Tables


1
Hash Tables
2
Objectives
  • Exploit random access to get O(1) time to lookup
    items in a collection
  • Use a hash function that converts an items value
    into its offset in a sequence
  • Be tolerant of data types for which the mapping
    function has collisions
  • The hash function will be imperfect, and two or
    more items may get hashed to the same position.
  • We want to respond to this be increasing the time
    complexity (up to O(n)), but we still need it to
    work.
  • Get Lucky infinitely often
  • Wed like to get O(1) access on average.

3
The Idea
  • Have an array of size m in which we will store n
    elements (let f n/m be the load factor of our
    table).
  • The array may contain elements directly, or it
    may contain sets of elements (more on this
    later).
  • To find something in the table, we first select
    one of the m positions using a hash function
    and then look for the item in that position.
  • If were lucky, the item will be found in O(1)
    time.

4
Whats a Hash Function?
  • A hash function is any function that computes
    (and returns) an integer based on one of the
    items we wish to insert into our table
  • The function must be fixed and deterministic
  • Ideally the function returns a distinct integer
    for each item in our table.
  • NOTE I describe the hashing process as two
    steps,
  • The hash function computes an arbitrary integer
    (from -2B to 2B)
  • The integer is then mapped onto a position within
    the table (a value between 0 and m-1 where m is
    the size of our table).

5
Whats a Collision
  • When two different items get mapped to the same
    position in the array its called a collision.
  • There are two basic techniques for dealing with
    collisions
  • Store a set of elements at each position in the
    array, and when theres a collision, just put
    both items into the collection at that position.
    This technique is called chaining
  • Use a pair of hash functions, one that is always
    used to find the first position, and if theres a
    collision, well use a second hash function to
    pick another slot. This technique is called
    rehashing.

6
Chaining with sequences
  • If collisions are resolved with chaining, then
    the hash table will be encoded as an array of
    vectorltitemgt (or possibly an array of listltitemgt,
    but thats almost certainly an inferior
    implementation).
  • We find an item by
  • Find the position within the array by hashing
  • Linear search through the elements in the chain
    (the vectorltitemgt) until we either find the item
    were looking for or reach the end of the chain.
  • The time complexity for looking up an item is
  • O(1) best case we can hope for, the length of
    all chains are bounded by a constant
  • O(n) worst case we can get, the length of the
    longest chain is proportional to n (e.g., all
    items map to the same position).

7
Simple Uniform Hashing
  • If we had a perfect hash function, then any key
    that we pick is equally likely to end up mapped
    to any one of the m entries in the table. This
    assumption is known as the simple uniform
    hashing assumption.

8
Whats the average cost?
  • Lets assume a table with n items, load factor f
    in which well do k lookups. Well assume that
    both k and n grow to infinity, that k gtgt n, and
    that f remains constant (more on this later).
  • Well also assume simple uniform hashing
  • For each of the k lookups, we are equally likely
    to select any position within the table.
  • The cost of each lookup is T(c) where c is the
    length of the chain.
  • If k is large, then each chain is equally likely
    to be visited. Thus, the average value of c will
    be f
  • Hence, on average, the cost for lookup under the
    assumption of simple uniform hashing is T(f).

9
Managing the load factor
  • Most implementations target a load factor f lt 1.
  • Note that the implementation can trivially keep
    track of m, n and therefore f.
  • Each time we insert a new item, increment n, etc.
  • When f exceeds the target threshold (e.g., 0.67),
    we
  • Allocate a new table with larger m
  • Compute hash values for all the items in the old
    table and insert them into the new table.
  • Items that collided before may not collide in the
    new table and vice versa
  • Throw away the old table.
  • We can also set a minimum value for f
  • If enough items are removed from the table, we
    can allocate a smaller table and move all the
    items into the new table.

10
Can Hash Tables have Iterators?
  • Yes, any collection can have an iterator.
  • However, a HashTable is not an ordered
    collection.
  • So, we can visit the values in any order (as long
    as we visit each one exactly once).
  • The order in which the values are visited may
    change.
  • For example, as more items are added, we may have
    to resize the table, increasing m
  • When the same set of items are hashed into the
    new array, they may end up in different
    positions, and in a different order.
  • For example, 11 comes before 2 in an array
    with 10 elements (using modulo m indexing), but
    the order is changed if the table is increased to
    size 20.

11
Should m be prime?
  • Its not unusual for the hashed values to have
    some regular sequence.
  • Some sequences will make rather poor use of the m
    locations in the array.
  • If the hash values are k, 2k, 3k, then we will
    use only 1/GCD(k, m)th of the values.
  • For example, if m is 10, and all our hash values
    are even, then GCD(m, 2) 2, so well use only ½
    of the slots in the array.
  • If m is prime, then unless the hash values happen
    to all be a multiple of m, the GCD will be 1, and
    well more evenly distribute the hashed values.
  • Unfortunately, x y is a very expensive
    operation on most computers (ten times as slow as
    x y).

12
Using Multiplication
  • Our basic problem is that were given a number k
    in 0,2B) and we need to map it into the range
    0, m).
  • We can use the following
  • Multiply k by A where A is an irrational number
    between 0 and 1.
  • Take the fractional part of kA, think of this as
    a percentage.
  • Index into the array by that percentage (e.g.,
    go 82 into m) by multiplying m by the
    percentage (and rounding down to the nearest
    integer).
  • Using (v5 1) / 2 for A seems to work pretty
    well.
  • Note that A is a constant, and can be
    precomputed.
  • If m is a power of 2, then we can make this
    really easy.
  • Precompute A left shifted by 32 bits.
  • Multiply k and A ignoring the overflow (i.e.,
    keep the least significant 32 bits)
  • Right shift the result by 32 log(m) bits. This
    is h(k).

13
Open Addressing / Rehashing
  • We use an array of keys (instead of an array of
    chains of keys).
  • Obviously n lt m
  • We need some way to indicate that an array
    position is empty (e.g., a nil value).
  • When there is a collision, we hash into the table
    again to find another position
  • The hash function is generalized to be a function
    of two values h(k, j) where k is the key, and j
    indicates how many times weve previously hashed
    this same key.

14
Linear Probing (not good)
  • h(k, j) (h(k) j) m
  • In other words, if you dont find the key in the
    first place you look, look next door
  • Keep searching the adjacent locations until you
    either find it, or until you find an empty slot.
  • This approach tends to degrade rapidly
  • Once you collide, you start to create a block
    of consecutive array positions that are filled.
  • This increases the probability that youll have a
    collision (and increases the time you spend
    probing as a result of a collision).

15
Double hashing (rehashing)
  • Use two hash functions
  • h(k, j) (h1(k) jh2(k)) mod m
  • For example
  • h1(k) h(k) m
  • h2(k) h(k) m2 where m2 lt m
  • If you let m be prime, and pick m2 equal to m-1
    then this technique can work very well.
Write a Comment
User Comments (0)
About PowerShow.com