Foundations of Software Design - PowerPoint PPT Presentation

About This Presentation

Title:

Foundations of Software Design

Description:

Simple Subproblems (and simple break-down) ... Subproblem Overlap: optimal solutions to unrelated problems can contain subproblems in common. ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 33

Provided by: coursesIs8

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Foundations of Software Design

1
Foundations of Software Design
Lecture 26 Text Processing, Tries, and Dynamic
Programming Marti Hearst Fredrik
Wallenberg Fall 2002
2
Problem String Search

Determine if, and where, a substring occurs
within a string

3
Approaches/Algorithms

Brute Force
Rabin-Karp
Tries
Dynamic Programming

4
Brute Force Algorithm
5
Worst-case Complexity
6
Best-case Complexity, String Found
7
Best-case Complexity, String Not Found
8
Rabin-Karp Algorithm

Calculate a hash value for
The pattern being searched for (length M), and
Each M-character subsequence in the text
Start with the first M-character sequence
Hash it
Compare the hashed search term against it
If they match, then look at the letters directly
Why do we need this step?
Else go to the next M-character sequence

(Note 1 Karp is a Turing-award winning prof. in
CS here!) (Note 2 CS theory is a good field to
be in because they name things after you!)
9
Karp-Rabin Looking for 31415

31415 mod 13 7
Thus compute each 5-char substring mod 13 looking
for 7
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
-----------
8
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
-----------
9
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
-----------
3
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
-----------
7

Found 7! Now check the digits
10
Rabin-Karp Algorithm

Worst case time?
N is length of the string
O(N) if the hash function is chosen well
http//orca.st.usm.edu/suzi/stringmatch/rk_alg.ht
ml
http//www.mills.edu/ACAD_INFO/MCS/CS/S00MCS125/St
ring.Matching.Algorithms/animations.html

(Note 1 Karp is a Turing-award winning prof. in
CS here!) (Note 2 CS theory is a good field to
be in because they name things after you!)
11
Tries

A tree-based data structure for storing strings
in order to make pattern matching faster
Main idea
Store all the strings from the document, one
letter at a time, in a tree structure
Two strings with the same prefix are in the same
subtree
Useful for IR prefix queries
Search for the longest prefix of a query string Q
that matches a prefix of some string in the trie
The name comes from Information Retrieval

12
Trie Example

The standard trie over the alphabet a,b for
the set aabab, abaab, babbb, bbaaa, bbbab

13
A Simple Incremental Algorithm

To build the trie, simple add one string at a
time
Check to see if the current character matches the
current node.
If so, move to the next character
If not, make a new branch labeled with the
mismatched character, and then move to the next
character
Repeat

14
Trie-growing Algorithm
buy bell hear see bid bear stop bull sell stock
15
Tries, more formally

The path from the root of T to any node
represents a prefix that is equal to the
concatenation of the characters encountered while
traversing the path.
An internal node can have from 1 to d children
where d is the size of the alphabet.
The previous example is a binary tree because the
alphabet had only 2 letters
A path from the root of T to an internal node i
corresponds to an i-character prefix of a string
S
The height of the tree is the length of the
longest string
If there are S unique strings, T has S leaf nodes
Looking up a string of length M is O(M)

16
Compressed Tries

Compression is done after the trie has been
built up cant add more items.

17
Compressed Tries

Also known as PATRICIA Trie
Practical Algorithm To Retrieve Information Coded
In Alphanumeric
D. Morrison, Journal of the ACM 15 (1968).
Improves a space inefficiency of Tries
Tries to remove nodes with only one child
(pardon the pun)
The number of nodes is proportional to the number
of strings, not to their total length
But this just makes the node labels longer
So this only helps if an auxiliary data structure
is used to actually store the strings
The trie only stores triplets of numbers
indicating where in the auxiliary data structure
to look

18
Compressed Trie
b
s
e
u
to
e
hear

ar
ll
id
ll
y
p
ck
ll
e
19
Suffix Tries

Regular tries can only be used to find whole
words.
What if we want to search on suffixes?
build, mini
Solution use suffix tries where each possible
suffix is stored in the trie
Example minimize

Findimi
i
m
i
20
Dynamic Programming

Used primarily for optimization problems.
Not just a good solution, but an optimal one.
Brute force algorithms
Try every possibility
Guarantee finding the optimal solution
But inefficient
DP requires a certain amount of structure,
namely
Simple Subproblems (and simple break-down)
Global optimum is a composition of subproblem
optimums
Subproblem Overlap optimal solutions to
unrelated problems can contain subproblems in
common.
In other words, can re-use the results of solving
the subproblem

21
Longest Common Subsequence

LCS find the longest string S that is a
subsequence of both X and Y, where
X is of length n
Y is of length m
Example what is the LCS of
supergalactic
galaxy
(The characters do not have to be contiguous)

22
Dynamic Programming Applied to LCS Problem
Lets compare X GTG
X0i Y CGATG Y0j We
represent the longest subsequence as Li,j
Longest Common Subsequence
23
Dynamic Programming for LCS

Note that the longest string of X and Y (Li,j)
must be equal to the longest string of ...
X0i-1 GT (removing the last G) Y0j-1
CGAT (removing the last G)
plus 1, since the matching Gs at Xi,Yj will
increase the length by one.

24
Dynamic Programming for LCS

If Xi,Yj had NOT matched, Li,j would have to be
equal to the longest string in Li-1,j or
Li,j-1.
If this is true for Li,j, it must be true for
all L.
We know that L-1,-1 0 (since both strings are
empty)
Finally we know that Li,j cannot be larger than
max(i,j)1

25
Dynamic Programming for LCS
For each position, take the max of Li-1,j or
Li,j-1 Add 1 when a new match is found.
26
Dynamic Programming

Running Time/Space
Strings of length m and n
O(mn)
Brute force algorithm 2m subsequences of x to
check against n elements of y O(n 2m)

27
Dynamic Programming vs. Greedy Algorithms

Sometimes they are the same.
Sometimes not
What makes an algorithm greedy?
Globally optimal solution can be obtained by
making locally optimal choices
Dynamic Programming
Solves subproblems, that can be re-used
Trickier to think of
More work to program

28
Greedy Vs. Dynamic Programming

The famous knapsack problem
A thief breaks into a museum. Fabulous
paintings, sculptures, and jewels are everywhere.
The thief has a good eye for the value of these
objects, and knows that each will fetch hundreds
or thousands of dollars on the clandestine art
collectors market. But, the thief has only
brought a single knapsack to the scene of the
robbery, and can take away only what he can
carry. What items should the thief take to
maximize the haul?

29
The Knapsack Problem

More formally, the 0-1 knapsack problem
The thief must choose among n items, where the
ith item worth vi dollars and weighs wi pounds
Carrying at most W pounds, want to maximize value
Note assume vi, wi, and W are all integers
0-1 b/c each item must be taken or left in
entirety
A variation, the fractional knapsack problem
Thief can take fractions of items
Think of items in 0-1 problem as gold ingots, in
fractional problem as buckets of gold dust

30
The Knapsack Problem Optimal Substructure

Both variations exhibit optimal substructure
To show this for the 0-1 problem, consider the
most valuable load weighing at most W pounds
If we remove item j from the load, what do we
know about the remaining load?
The remainder must be the most valuable load
weighing at most W - wj that the thief could take
from museum, excluding item j

31
Solving The Knapsack Problem

The optimal solution to the fractional knapsack
problem can be found with a greedy algorithm
The optimal solution to the 0-1 problem cannot be
found with the same greedy strategy
Greedy strategy take in order of dollars/pound
Example 3 items weighing 10, 20, and 30 pounds,
knapsack can hold 50 pounds
Suppose item 2 is worth 100. Assign values to
the other items so that the greedy strategy will
fail

32
The Knapsack Problem Greedy Vs. Dynamic