Data Structures and Algorithms for Information Processing presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data Structures and Algorithms for Information Processing

1
Data Structures and Algorithms for Information
Processing

Lecture 9 Searching

2
Outline

The simplest method serial search
Binary search
Open-address hashing
Chained hashing

3
Search Algorithms

Whenever large amounts of data need to be
accessed quickly, search algorithms are crucially
involved.

4
Search Algorithms

Lie at the heart of many computer
technologies. To name a few
Databases
Information retrieval applications
Web infrastructure (file systems, domain name
servers, etc.)

5
Search Algorithms Two Broad Categories

Searching a static database
Accessing indexed Web pages
Finding a file on disk
Evaluating a dynamically changing set of
hypotheses
Computer chess
Speech recognition
Well be concerned with the first

6
The Simplest Search Serial Lookup

Items stored in an array or list
To search for an item x
Start at the beginning of the list
Compare the current item to x
If unequal, proceed to next item

7
Pseudocode for Serial Search

// Find x in an array a of length n
int i0
boolean found false
while ((i lt n) !found)
if (ai x)
found true
else i
if (found) ...

8
Analysis for Serial Search

Best case Requires one array access
Worst case Requires n array accesses
Average case To access an item, assuming
position is random (uniform)
(123...n)/n n(n1)/2n
(n1)/2
O(n)

9
A Useful Combinatorial Identity
123n n(n1)/2 Why? (derived on p. 22 of
Main)
10
Visual Counting
nn
11
Visual Counting
n
12
Visual Counting
nn - n
13
Visual Counting
(nn - n)/2 n n(n1)/2
14
Binary Search

Can be used whenever the data are totally ordered
-- e.g., the integers
Requires sorting in advance, and storing in an
array
One of the simplest to implement, often fast
enough
Can be tricky to handle boundary cases

15
Idea of Binary Search

Closely related to the natural algorithm we use
to look up a word in a dictionary
Open to the middle
If target comes before all words on the page,
search in left half of book
Otherwise, search in right half.

16
Interface for Binary Search

int search(int a, int first, int size, int
target)
Parameters
int a array to be searched over
Search over afirst,first1,...,firstsize-1
Invariants
array is sorted in increasing order
first gt 0

17
Implementation

int search (int a, int start, int size, int
target)
if (size lt 0) return -1
else
int middle start size/2
if (amiddle target) return middle
else if (target lt amiddle)
return search(a, start, size/2,
target)
else
return search(a, middle1, size/2,
target)

18
Implementation

Wheres the error??
Suppose size is even. Are
new sizes correct?
Suppose size is odd. Are
new sizes correct?

19
Implementation

int search (int a, int first, int size, int
target)
if (size lt 0) return -1
else
int middle first size/2
if (amiddle target) return middle
else if (target lt amiddle)
return search(a, first, size/2,
target)
else
return search(a, middle1,
(size-1)/2, target)

20
Boundary Cases

Binary search is sometimes tricky to get right.
A common source of bugs.
Test cases are not always helpful for checking
correctness of code.

21
Binary Search with Other Data Structures

Can binary search be implemented using linked
lists rather than arrays?
Are there any other data structures that could be
used?

22
Analysis of Binary Search

Recursively dividing up array in half represents
data as a full binary tree.
Simplest case -- array of size n 2k -1,
complete binary tree.
Take away one and divide by 2.
New Size 2(k-1)-1.
Thus, worst case involves O(log n) operations

23
Average Case

A complete binary tree with k leaves has k-1
internal nodes.
So, about half of the n data elements require
O(log n) operations to find.
Thus, assuming uniform distribution on target
elements, average cost is also O(log n).

24
Binary Search is Limited

When we have a large number of items that will
be accessed in part of the program where
efficiency is crucial, binary search may be too
slow

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Hashing

Fortunately, we can often do better
Hashing is a technique that where the access time
can be O(1) rather than O(log n)

30
Open Address Hashing

The basic technique
Items are stored in an array of size N
The preferred position in the array is computed
using a hash function of the items key
When adding an item, if the preferred position is
occupied, the next open position in the array is
used instead.

31
Open Address Hashing

Mains presentation for Chapter 11

32
A Basic Hash Table

We keep arrays for the keys and data, and a bit
indicating whether a given position has been
occupied
private class Table
private int numItems
private Object keys
private Object data
private boolean hasBeenUsed
....

33
The Hash Function

We can use the built in hash function that Java
provides
private int hash (Object key)
return Math.abs(key.hashCode())
data.length

34
Calculating the Index

// If found return value is index of key
private int findIndex(Object key)
int count0
int ihash(key)
while ((count lt data.length)
(hasBeenUsedi))
if (key.equals(keysi)) return i
i nextIndex(i)
count
return -1

35
Inserting an Item

public Object put (Object key, Object element)
int index findIndex(key)
if (index ! -1)
Object answer dataindex
dataindex element
return answer
else if (numItems lt data.length)
....

36
Inserting an Item

public Object put (Object key, Object element)
...
else if (numItems lt data.length)
index hash(key)
while (keysindex ! null)
index nextIndex(index)
keysindex key
dataindex element
hasBeenUsedindex true
numItems
return null
else throw new IllegalStateException(Table
full)
....

37
Two Hashes are Better than One

Collisions can result in long stretches of
positions with keys not in their preferred
position
This is called clustering
To address this problem, when a collision results
we jump a random number of positions, using a
second hash function

38
Double Hashing

Find the first position using hash1(key)
If theres a collision, step through the array in
steps of size hash2(key)
i (i hash2(key)) data.length
To avoid cycles, hash2(key) and the length of the
array must be relatively prime (no common factors)

39
Double Hashing

Knuths technique to avoid cycles
Choose the length of the array so that both
data.length and data.length-2 are prime
hash1(key)
Math.abs(key.hashCode()) length
hash2(key)
1 (Math.abs(key.hashCode()) (length-1)

40
Issues with O-A Hashing

Each array cell holds only one element
Collisions and clustering can degrade performance
Once the array is full, no more elements can be
added, unless we
create a new array with the right size and hash
functions
re-hash the original elements

41
Chained Hashing

Each array cell can hold more than one element of
the hash table
Hash the key of each element to obtain the array
index
When a collision happens, the element is still
placed at the original hash index
How is this handled?

42
Answer

Each array location must be implemented with a
data structure that can hold a group of elements
with the same hash index
Most common approach
each array location stores the head of a linked
list
items in the list all have the same has index

43
Chained Hashing
table

0
1
2
3
Any number of elements can beadded to the table
without a need to rehash

Write a Comment

User Comments (0)

About PowerShow.com

Data Structures and Algorithms for Information Processing PowerPoint PPT Presentation