Title: Applications of Exhaustive Search
1Lecture 4
- Applications of Exhaustive Search
- Jones Pevzner, Secs. 4.1-4.6
2Restriction Mapping Problem
- Restriction endonucleases cut DNA at particular
recognition sequences - Before the evolution of DNA sequencing methods,
restriction digests provided limited sequence
information about a sample of DNA. - By varying experimental conditions, you can
perform a partial or a complete digest - maximal
information is gained by performing a partial
digest.
3Formalizing the problem
0
2
4
10
7
End- point
End- point
Restriction sites
Multiset - formed of all pairwise distances
between sites
For the example above,
4Inverse problem
- Given the multiset, can we predict the
distribution of restriction sites along the
sample of DNA? - This maps to the Turnpike problem of Computer
Science - given the set of all distances between
exits on the turnpike, can you determine the
distribution of exits along the highway (i.e. the
roadmap)?
5Solutions are not always unique
- Add a set increment to all the site locations,
and you get the same digest - symbolizedAlso
the reflectionEquivalent restriction site
maps (in the sense of leading to the same partial
digest) are said to be homometric.
6A quick way to generate two homometric site sets
Given sets U and V, then
are guaranteed to be homometric
(see example on p. 87)
7 A brute-force (exhaustive) algorithm
Given a multiset L of m distances, and n sites
(where m n(n-1)/2 BRUTEFORCEPDP( L, n )M lt-
maximal element in L for (every set of n-2
integers, 0ltx2ltltxn-1ltM and xi in L) X lt-
0, x2,,xn-1,M Form DeltaX from X if(
DeltaX L ) return X output no solution
8Time complexity of BRUTEFORCEPDP( L, n )
Assume that every pass through the loop takes
about the same time Then the exec time scales as
According to Jones Pevzner, this goes as
O(n2n-4). (I have not checked that!) Is this a
tractable algorithm?
9A practical algorithm for PDP
Due to Skiena (1990). Work through the example
from text Given the multiset L
2,2,3,3,4,5,6,7,8,10, find the vector of sites
X. First, the width of the fragment is clearly
10 WIDTH 10. X (0, 10). L 2,2,3,3,4,5,6,7,8
What is the next biggest fragment? Its 8. The
next site must be at 8 or 2 (WIDTH - 8). At this
stage the problem is still symmetric, so lets
choose X(0,2,10). My new site adds fragments 2,
8 - remove them. L 2,3,3,4,5,6,7
10Skiena PDP example (cont)
Where we are X(0,2,10). L 2,3,3,4,5,6,7 The
next biggest fragment is 7. Possible positions
for the next site are 7, and WIDTH - 7 3. Try
3 New fragments generated against my current set
of sites will include one of size 1 thats not
in L, so I reject this choice. Try 7 New
fragments generated are size 7, 5, 3. This choice
is OK X (0, 2, 7, 10), L 2,3,4,6 Next
biggest fragment is 6 possible positions are 6,
and WIDTH - 6 4. Can we immediately reject the
former?
11Skiena PDP example (cont)
Again, we currently have X (0, 2, 7, 10), L
2,3,4,6 Try site at position 4 New fragments
generated are of sizes 4, 2, 3, 6. All of these
are in L, so now we have X (0, 2, 4, 7, 10), L
L is empty, so were finished!
12Pseudocode for the Skiena algorithm
This is implemented as a recursive algorithm.Why
might you expect that??
PARTIALDIGEST(L) WIDTH lt- max element in
L DELETE(WIDTH,L) X lt- (0, WIDTH) PLACE(L, X)
PLACE on next page
13PLACE() subroutine
PLACE(L,X) if L empty output X return y lt-
max element in L if DELTA(y,X) contained in
L ADD(y,X) DELETE(DELTA(y,X),
L) PLACE(L,X) DELETE(y,X) ADD(DELTA(y,X),
L) if DELTA(WIDTH-y,X) contained in
L ADD(WIDTH-y,X) DELETE(DELTA(WIDTH-y,X),L)
PLACE(L,X)return
14Time complexity for Skiena
Complex to analyze - depends on how often
alternatives have to be considered. Best case -
quadratic Worst case - exponential
15Finding motifs
- Like the detective in the Edgar Allan Poe story,
we can imagine be confronted with a text that has
been encoded using a strange character set.
Following the detective, we might expect that
combinations of characters that occur more
frequently than expected by chance might be
meaningful. - The simplest approach is to simply keep count of
all words of some selected size L that appear in
the text. We call these - words, or
- L-mers, or
- L-tuples
16Example - counting words
SEQUENCE ACTGTCAAAAGTCATCAAAATCG ACTG - 1 CTGT -
1 TGTC - 1 GTCA - 2 TCAA - 2 CAAA - 2 AAAA -
2 AAAG - 1 AAGT - 1 AGTC - 1 TCAT - 1 CATC -
1 ATCA - 1 AAAT - 1 AATC - 1 ATCG - 1 How did I
generate this? How would you do this in a C
program?
17Its not that easy!
- Motifs in DNA sequences are usually NOT strictly
conserved! You cant just count words. This is
similar to the Edgar Allan Poe story, except the
pirate was drunk when he encoded the message, and
didnt spell very well to begin with. - Consider the table of NF-kappaB binding sites
(Table 4.2)
18How to define motifs/profiles
Best done via an alignment. We imagine a set of
aligned sequences from which we can extract a
window that is L nucleotides wide. If there are
t sequences, then we have a profile matrix that
has t rows and L columns. We can count how many
As, Ts, Gs and Cs appear in each column, which
provides a measure of the variability along the
profile. It also lets us define a consensus
sequence.
19NF-kappaB Example
etc T T T G
G G A G T C C C T C G G T G A T
T T T C T A G G G G A A G A C C
---------------------------------- A2 3 1 0
0 1 16 3 1 2 4 1 T12 3 1 0 4 1 0 9
15 11 5 6 G1 0 16 18 14 16 1 1 1 0 0
0 C3 12 0 0 0 0 1 5 1 5 9 11
CONSENSUS T C G G G G A T T T C C
20A way to visualize consensus sequence - sequence
logo
T. D. Schneider and R. M. Stephens, Sequence
Logos A New Way to Display Consensus Sequences
Nucl. Acids Res. 18 6097-6100, 1990.
21How do we make the alignments?
- We dont know the motifs in advance so, we need
to construct ALL alignments of length L. - Assume we have t sequences to align We can
construct a vector S of t integers that specifies
a starting position in each sequence for the
L-mer to be extracted from that sequence. The
L-mers are then assembled to make a profile
matrix. - By generating all possible vectors S, we make all
possible profiles of width L.
22Now weve made all the profiles how do we rank
them?
- Interesting profiles depart from uniformity
somewhere along their length if the consensus is
weak, the profile might represent random
variation. - One way to quantitate the strength of a profile
is via a profile score.
23Computing a Profile Score
- Let P(S) be the profile matrix corresponding to
positions vector S, and MP(s)(j) be the maximum
character count in the jth column of the profile.
Then The maximum score possible is tL. The
worst is tL/4 (equal mix of characters at each
position in the profile.
24Motif-Finding Problem
- Given a set of DNA sequences, find a set of
L-mers, one from each sequence, that maximimzes
the consensus score. - Clearly, this can be done by exhaustive search!
25Another angle on the problem
- Compute the Hamming distance dH between two
strings of equal length as the number of
positions that differ between the two
strings dH(ATTGTC,ACTCTC) 2 - Given a profile defined by S, we define the total
Hamming profile distance between an L-mer v and
the aligned profilewhere si is the ith
sequence in the profile.
26Total Hamming Distance
- For a given L-mer v, and a set of t sequences,
the Total Hamming Distance is the minimum Hamming
Profile distance for v, taken over all possible
profiles (choices for the vector S).
Symbolically,
27The Median String Problem
- Given a set of DNA sequences, find a string v of
length L which has minimum Total Hamming
Distance. This is the median string. - Presents a double minimization problem -
- For a given string v, find the motif with minimal
Hamming Profile Distance - Minimize the minimal Hamming Profile Distance
over all choices for v.
28The Motif-Finding and Median String Problems are
Equivalent!
First, we observe that the consensus string for a
given profile IS ALSO the string with minimum
Hamming Profile distance.Second, we observe
that the Hamming profile distance and the profile
score are linked. If w is the consensus sequence
and v is an arbitrary L-mer, So, maximizing
the profiles score is the same as minimizing the
Total Hamming Distance!
29Control Flow,Functions Program Structures
- Kernighan Ritchie
- Chapts. 3 4
30Control elements
- Statement blocks
- Curly braces are used to group statements into
units they become syntactically equivalent to a
single statement. - Blocks are used with if-else, while, for and
other control statements -
31if-else constructs
- Please follow the advice of the text and ALWAYS
enclose clauses in curly braces neglecting to do
this is a frequent source of hard-to-find errors.
32switch statement
- This statement is preferred for selecting from a
large number of options. - Be sure to understand how execution falls
through after a matching case block is executed,
and the necessity of using break to escape this
behavior.
33for and while loops
- Note that these constructs are very flexible
often a matter of personal preference which to
use. Be sure to understand - The three fields of the for statement
- The roles of break and continue
- Possibilities of using commas to separate
statements (NOT recommended in general!)
34goto
- In general, dont.
- That said, in some situations you need to do
something silly to get around it. It should not
be thought of as forbidden, just not recommended.
35Functions
- Functions allow programs to be broken up into
functional pieces they may take arguments and
may return values. - Arguments must be supplied in a specified order.
- The prototype of a function declares the list of
expected arguments and also the return type of
the function. - If a prototype is declared outside of any
function in a file, it is globally available to
all the functions that appear after it. - Prototypes for library functions are declared in
header files, like stdio.h.