151 - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

151

Description:

like a priority queue, a dictionary stores key-element pairs ... Implementing a Dictionary with a Sequence ... There are a few ways to design this dictionary: ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 50
Provided by: christin90
Category:
Tags: dictionary

less

Transcript and Presenter's Notes

Title: 151


1
Dictionaries, Tables Hashing
  • TCSS 342

2
The Dictionary ADT
  • a dictionary (table) is an abstract model of a
    database
  • like a priority queue, a dictionary stores
    key-element pairs
  • the main operation supported by a dictionary is
    searching by key

3
Examples
  • Telephone directory
  • Library catalogue
  • Books in print key ISBN
  • FAT (File Allocation Table)

4
Main Issues
  • Size
  • Operations search, insert, delete, ??? Create
    reports??? List?
  • What will be stored in the dictionary?
  • How will be items identified?

5
The Dictionary ADT
  • simple container methods
  • size()
  • isEmpty()
  • elements()
  • query methods
  • findElement(k)
  • findAllElements(k)

6
The Dictionary ADT
  • update methods
  • insertItem(k, e)
  • removeElement(k)
  • removeAllElements(k)
  • special element
  • NO_SUCH_KEY, returned by an unsuccessful search

7
Implementing a Dictionary with a Sequence
  • unordered sequence
  • searching and removing takes O(n) time
  • inserting takes O(1) time
  • applications to log files (frequent insertions,
    rare searches and removals) 34 14 12 22 18

34
14
12
22
18
8
Implementing a Dictionary with a Sequence
  • array-based ordered sequence (assumes keys can
    be ordered)- searching takes O(log n) time
    (binary search)- inserting and removing takes
    O(n) time- application to look-up tables
    (frequent searches, rare insertions and removals)

12
14
18
22
34
9
Binary Search
  • narrow down the search range in stages
  • high-low game
  • findElement(22)

2
4
5
7
8
9
12
14
17
19
22
25
27
28
33
37
14
low
mid
high
10
Binary Search
2
4
5
7
8
9
12
14
17
19
22
25
27
28
33
37
25
low
mid
high
2
4
5
7
8
9
12
14
17
19
22
25
27
28
33
37
19
low
mid
high
2
4
5
7
8
9
12
14
17
19
22
25
27
28
33
37
22
low mid high
11
Pseudocode for Binary SearchAlgorithm
  • BinarySearch(S, k, low, high)if low high then
    return NO_SUCH_KEYelse mid (lowhigh) /
    2if k key(mid) then return key(mid)else
    if k k, low, mid-1)else return BinarySearch(S,
    k, mid1, high)

12
Running Time of Binary Search
  • The range of candidate items to be searched is
    halved after each comparison

13
Running Time of Binary Search
  • In the array-based implementation, access by rank
    takes O(1) time, thus binary search runs in O(log
    n) time
  • Binary Search is applicable only to Random Access
    structures (Arrays, Vectors)

14
Implementations
  • Sorted? Non Sorted?
  • Elementary Arrays, vectors linked lists
  • Orgainization None (log file), Sorted, Hashed
  • Advanced balanced trees

15
Skip Lists
  • Simulate Binary Search on a linked list.
  • Linked list allows easy insertion and deletion.
  • http//www.epaperpress.com/s_man.html

16
Hashing
  • Place item with key k in position h(k).
  • Hope h(k) is 1-1.
  • Requires unique key (unless multiple items
    allowed). Key must be protected from change (use
    abstract class that provides only a constructor).
  • Keys must be comparable.

17
Key class
  • public abstract class KeyID
  • Private Comparable searchKey
  • Public KeyID(Comparable m)
  • searchKey m
  • //Only one constructor
  • public Comparable getSearchKey()
  • return searchKey

18
Hash Tables
  • RTT is a large phone company, and they want to
    provide enhanced caller ID capability
  • given a phone number, return the callers name
  • phone numbers are in the range 0 to R 10101
  • n is the number of phone numbers used
  • want to do this as efficiently as possible

19
Alternatives
  • There are a few ways to design this dictionary
  • Balanced search tree (AVL, red-black, 2-4 trees,
    B-trees) or a skip-list with the phone number as
    the key has O(log n) query time and O(n) space
    --- good space usage and search time, but can we
    reduce the search time to constant?
  • A bucket array indexed by the phone number has
    optimal O(1) query time, but there is a huge
    amount of wasted space O(n R)

20
Bucket Array
  • Each cell is thought of as a bucket or a
    container
  • Holds key element pairs
  • In array A of size N, an element e with key k is
    inserted in Ak.
  • Table operations without searches!

(null)
(null)
Roberto
(null)


000-000-0000 000-000-0001
401-863-7639 ... 999-999-9999 Note we
need 10,000,000,000 buckets!
21
Generalized indexing
  • Hash table
  • Data storage location associated with a key
  • The key need not be an integer, but keys must be
    comparable.

22
Hash Tables
  • A data structure
  • The location of an item is determined
  • Directly as a function of the item itself
  • Not by a sequence of trial and error comparisons
  • Commonly used to provide faster searching.
  • Comparisons of searching time
  • O(n) for linear searches
  • O (logn) for binary search
  • O(1) for hash table

23
Examples
  • A symbol table constructed by a compiler.
  • Stores identifiers and information about them in
    an array.
  • File systems
  • I-node location of a file in a file system.
  • Personal records
  • Personal information retrieval based on key

24
Hashing Engine
  • itemKey

Position Calculator
25
Example
  • Insert item (401-863-7639, Roberto) into a table
    of size 5
  • calculate 4018637639 mod 5 4, insert item
    (401-863-7639, Roberto) in position 4 of the
    table (array, vector).
  • A lookup uses the same process use the hash
    engine to map the key to a position, then check
    the array cell at that position.

401- 863-7639 Roberto
0 1 2 3
4
26
Chaining
  • The expected, search/insertion/removal time is
    O(n/N), provided the indices are uniformly
    distributed
  • The performance of the data structure can be
    fine-tuned by changing the table size N

27
From Keys to Indices
  • The mapping of keys to indices of a hash table is
    called a hash function
  • A hash function is usually the composition of two
    maps
  • hash code map key ? integer
  • compression map integer ? 0, N - 1
  • An essential requirement of the hash function is
    tomap equal keys to equal indices.
  • A good hash function is fast and minimizes the
    probability of collisions

28
Perfect hash functions
  • A perfect hash function maps each key to a unique
    position.
  • A perfect hash function can be constructed if we
    know in advance all the keys to be stored in the
    table (almost never)

29
A good hash function
  • Be easy and fast to compute
  • Distribute items evenly throughout the hash table
  • Efficient collision resolution.

30
Popular Hash-Code Maps
  • Integer cast for numeric types with 32 bits or
    less, we can reinterpret the bits of the number
    as an int
  • Component sum for numeric types with more than
    32 bits (e.g., long and double), we can add the
    32-bit components.

31
Sample of hash functions
  • Digit selection
  • h(2536924520) 590
  • (select 2-nd, 5-th and last digits).
  • This is usually not a good hash function. It will
    not distribute keys evenly.
  • A hash function should use every part of the key.

32
Sample (continued)
  • Folding add all digits
  • Modulo arithmetic
  • h(key) h(x) x mod table_size.
  • The modulo arithmetic is a very popular basis for
    hash functions. To better the chance of even
    distribution table_size should be a prime number.
    If n is the number of items there is always a
    prime p, n

33
Popular Hash-Code Maps
  • Polynomial accumulation for strings of a natural
    language, combine the character values (ASCII or
    Unicode) a 0 a 1 ... a n-1 by viewing them as the
    coefficients of a polynomial a 0 a 1 x ...
    a n-1 x n-1
  • For instance, choosing x 33, 37, 39, or 41
    gives at most 6 collisions on a vocabulary of
    50,000 English words.

34
Popular Hash-Code Maps
  • Why is the component-sum hash code bad for
    strings?

35
Popular Compression Maps
  • Division h(k) k mod N
  • the choice N 2 k is bad because not all the bits
    aretaken into account
  • the table size N is usually chosen as a
    primenumber
  • certain patterns in the hash codes are propagated
  • Multiply, Add, and Divide (MAD)
  • h(k) ak b mod N
  • eliminates patterns provided a mod N ¹ 0
  • same formula used in linear congruential
    (pseudo)random number generators

36
Java Hash
  • Java provides a hashCode() method for the Object
    class, which typically returns the 32-bit memory
    address of the object.
  • This default hash code would work poorly for
    Integer and String objects
  • The hashCode() method should be suitably
    redefined by classes.

37
Collision
  • A collision occurs when two distinct items are
    mapped to the same position.
  • Insert (401-863-9350, Andy) ? 0
  • And insert (401-863-2234, Devin). 4018632234 ? 4.
    We have a collision!

401- 863-9350 Andy
401- 863-7639 Roberto
0 1 2 3
4
38
Collision Resolution
  • How to deal with two keys which map to the same
    cell of the array?
  • Need policies, design good Hashing engines that
    will minimize collisions.

39
Chaining I
  • Use chaining
  • Each position is viewed as a container of a list
    of items, not a single item. All items in this
    list share the same hash value.

40
Chaining II
0 1 2 3 4
41
Collisions resolution policies
  • A key is mapped to an already occupied table
    location
  • what to do?!?
  • Use a collision handling technique
  • Chaining (may have less buckets than items)
  • Open Addressing (load factor
  • Linear Probing
  • Quadratic Probing
  • Double Hashing

42
Linear Probing
  • If the current location is used, try the next
    table location
  • linear_probing_insert(K)if (table is full)
    errorprobe h(K)while (tableprobe
    occupied)probe (probe 1) mod Mtableprobe
    K

43
Linear Probing
  • Lookups walk along table until the key or an
    empty slot is found
  • Uses less memory than chaining
  • dont have to store all those links
  • Slower than chaining
  • may have to walk along table for a long way
  • Deletion is more complex
  • either mark the deleted slot
  • or fill in the slot by shifting some elements down

44
Linear Probing Example
  • h(k) k mod 13
  • Insert keys
  • 18 41 22 44 59 32 31 73

0 1 2 3 4 5 6 7
8 9 10 11 12
41
18
44
59
32
22
31
72
0 1 2 3 4 5 6 7
8 9 10 11 12
45
Double Hashing
  • Use two hash functions
  • If M is prime, eventually will examine every
    position in the table
  • double_hash_insert(K)if(table is full)
    errorprobe h1(K)offset h2(K)while
    (tableprobe occupied) probe (probe
    offset) mod Mtableprobe K

46
Double Hashing
  • Many of same (dis)advantages as linear probing
  • Distributes keys more uniformly than linear
    probing does

47
Double Hashing Example
  • h1(K) K mod 13
  • h2(K) 8 - K mod 8
  • we want h2 to be an offset to add
  • 18 41 22 44 59 32 31 73
  • h1(44) 5 (occupied) h2(0) 8 44 ? 58 Mod 13

0 1 2 3 4 5 6 7
8 9 10 11 12
44
41
73
18
32
53
31
22
0 1 2 3 4 5 6 7
8 9 10 11 12
48
Why so many Hash functions?
  • Its different strokes for different folks.
  • We seldom know the nature of the object that will
    be stored in our dictionary.

49
A FAT Example
  • Directory Key file name. Data (time, date,
    size ) location of first block in the FAT table.
  • If first block is in physical location 23 (Disk
    block number) look up position 23 in the FAT.
    Either shows end of file or has the block number
    on disk.
  • Example Directory entry block 4
  • FAT x x x F 5 6 10 x 23 25
  • 3
  • The file occupies blocks 4,5,6,10, 3.
Write a Comment
User Comments (0)
About PowerShow.com