Data Structures and Algorithms for Information Processing PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Data Structures and Algorithms for Information Processing


1
Data Structures and Algorithms for Information
Processing
  • Lecture 9 Searching

2
Outline
  • The simplest method serial search
  • Binary search
  • Open-address hashing
  • Chained hashing

3
Search Algorithms
  • Whenever large amounts of data need to be
    accessed quickly, search algorithms are crucially
    involved.

4
Search Algorithms
  • Lie at the heart of many computer
    technologies. To name a few
  • Databases
  • Information retrieval applications
  • Web infrastructure (file systems, domain name
    servers, etc.)

5
Search Algorithms Two Broad Categories
  • Searching a static database
  • Accessing indexed Web pages
  • Finding a file on disk
  • Evaluating a dynamically changing set of
    hypotheses
  • Computer chess
  • Speech recognition
  • Well be concerned with the first

6
The Simplest Search Serial Lookup
  • Items stored in an array or list
  • To search for an item x
  • Start at the beginning of the list
  • Compare the current item to x
  • If unequal, proceed to next item

7
Pseudocode for Serial Search
  • // Find x in an array a of length n
  • int i0
  • boolean found false
  • while ((i lt n) !found)
  • if (ai x)
  • found true
  • else i
  • if (found) ...

8
Analysis for Serial Search
  • Best case Requires one array access
  • Worst case Requires n array accesses
  • Average case To access an item, assuming
    position is random (uniform)
  • (123...n)/n n(n1)/2n
  • (n1)/2
  • O(n)

9
A Useful Combinatorial Identity
123n n(n1)/2 Why? (derived on p. 22 of
Main)
10
Visual Counting
nn
11
Visual Counting
n
12
Visual Counting
nn - n
13
Visual Counting
(nn - n)/2 n n(n1)/2
14
Binary Search
  • Can be used whenever the data are totally ordered
    -- e.g., the integers
  • Requires sorting in advance, and storing in an
    array
  • One of the simplest to implement, often fast
    enough
  • Can be tricky to handle boundary cases

15
Idea of Binary Search
  • Closely related to the natural algorithm we use
    to look up a word in a dictionary
  • Open to the middle
  • If target comes before all words on the page,
    search in left half of book
  • Otherwise, search in right half.

16
Interface for Binary Search
  • int search(int a, int first, int size, int
    target)
  • Parameters
  • int a array to be searched over
  • Search over afirst,first1,...,firstsize-1
  • Invariants
  • array is sorted in increasing order
  • first gt 0

17
Implementation
  • int search (int a, int start, int size, int
    target)
  • if (size lt 0) return -1
  • else
  • int middle start size/2
  • if (amiddle target) return middle
  • else if (target lt amiddle)
  • return search(a, start, size/2,
    target)
  • else
  • return search(a, middle1, size/2,
    target)

18
Implementation
  • Wheres the error??
  • Suppose size is even. Are
  • new sizes correct?
  • Suppose size is odd. Are
  • new sizes correct?

19
Implementation
  • int search (int a, int first, int size, int
    target)
  • if (size lt 0) return -1
  • else
  • int middle first size/2
  • if (amiddle target) return middle
  • else if (target lt amiddle)
  • return search(a, first, size/2,
    target)
  • else
  • return search(a, middle1,
    (size-1)/2, target)

20
Boundary Cases
  • Binary search is sometimes tricky to get right.
  • A common source of bugs.
  • Test cases are not always helpful for checking
    correctness of code.

21
Binary Search with Other Data Structures
  • Can binary search be implemented using linked
    lists rather than arrays?
  • Are there any other data structures that could be
    used?

22
Analysis of Binary Search
  • Recursively dividing up array in half represents
    data as a full binary tree.
  • Simplest case -- array of size n 2k -1,
    complete binary tree.
  • Take away one and divide by 2.
  • New Size 2(k-1)-1.
  • Thus, worst case involves O(log n) operations

23
Average Case
  • A complete binary tree with k leaves has k-1
    internal nodes.
  • So, about half of the n data elements require
    O(log n) operations to find.
  • Thus, assuming uniform distribution on target
    elements, average cost is also O(log n).

24
Binary Search is Limited
  • When we have a large number of items that will
    be accessed in part of the program where
    efficiency is crucial, binary search may be too
    slow

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Hashing
  • Fortunately, we can often do better
  • Hashing is a technique that where the access time
    can be O(1) rather than O(log n)

30
Open Address Hashing
  • The basic technique
  • Items are stored in an array of size N
  • The preferred position in the array is computed
    using a hash function of the items key
  • When adding an item, if the preferred position is
    occupied, the next open position in the array is
    used instead.

31
Open Address Hashing
  • Mains presentation for Chapter 11

32
A Basic Hash Table
  • We keep arrays for the keys and data, and a bit
    indicating whether a given position has been
    occupied
  • private class Table
  • private int numItems
  • private Object keys
  • private Object data
  • private boolean hasBeenUsed
  • ....

33
The Hash Function
  • We can use the built in hash function that Java
    provides
  • private int hash (Object key)
  • return Math.abs(key.hashCode())
    data.length

34
Calculating the Index
  • // If found return value is index of key
  • private int findIndex(Object key)
  • int count0
  • int ihash(key)
  • while ((count lt data.length)
    (hasBeenUsedi))
  • if (key.equals(keysi)) return i
  • i nextIndex(i)
  • count
  • return -1

35
Inserting an Item
  • public Object put (Object key, Object element)
  • int index findIndex(key)
  • if (index ! -1)
  • Object answer dataindex
  • dataindex element
  • return answer
  • else if (numItems lt data.length)
  • ....

36
Inserting an Item
  • public Object put (Object key, Object element)
  • ...
  • else if (numItems lt data.length)
  • index hash(key)
  • while (keysindex ! null)
  • index nextIndex(index)
  • keysindex key
  • dataindex element
  • hasBeenUsedindex true
  • numItems
  • return null
  • else throw new IllegalStateException(Table
    full)
  • ....

37
Two Hashes are Better than One
  • Collisions can result in long stretches of
    positions with keys not in their preferred
    position
  • This is called clustering
  • To address this problem, when a collision results
    we jump a random number of positions, using a
    second hash function

38
Double Hashing
  • Find the first position using hash1(key)
  • If theres a collision, step through the array in
    steps of size hash2(key)
  • i (i hash2(key)) data.length
  • To avoid cycles, hash2(key) and the length of the
    array must be relatively prime (no common factors)

39
Double Hashing
  • Knuths technique to avoid cycles
  • Choose the length of the array so that both
    data.length and data.length-2 are prime
  • hash1(key)
  • Math.abs(key.hashCode()) length
  • hash2(key)
  • 1 (Math.abs(key.hashCode()) (length-1)

40
Issues with O-A Hashing
  • Each array cell holds only one element
  • Collisions and clustering can degrade performance
  • Once the array is full, no more elements can be
    added, unless we
  • create a new array with the right size and hash
    functions
  • re-hash the original elements

41
Chained Hashing
  • Each array cell can hold more than one element of
    the hash table
  • Hash the key of each element to obtain the array
    index
  • When a collision happens, the element is still
    placed at the original hash index
  • How is this handled?

42
Answer
  • Each array location must be implemented with a
    data structure that can hold a group of elements
    with the same hash index
  • Most common approach
  • each array location stores the head of a linked
    list
  • items in the list all have the same has index

43
Chained Hashing
table

0
1
2
3
Any number of elements can beadded to the table
without a need to rehash
Write a Comment
User Comments (0)
About PowerShow.com