Title: Comp' Genomics
1Comp. Genomics
- Recitation 2
- 12/3/09
- Slides by Igor Ulitsky
2Outline
- Alignment re-cap
- End-space free alignment
- Affine gap alignment algorithm and proof
- Bounded gap/spaces alignments
3Dynamic programming
- Useful in many string-related settings
- Will be repeatedly used in the course
- General idea
- Confine the exponential number of possibilities
into some hierarchy, such that the number of
cases becomes polynomial
4Dynamic programming for shortest paths
- Finding the shortest path from X to Y using the
Floyd Warshall - Idea if we know what is the shortest path using
intermediate vertices 1,, k-1, computing
shortest paths using 1,, k is easy - wij if k0
- dij(k) mindij(k-1), dik(k-1)dkj(k-1) otherwise
5Alignment reminder
Something1G
Something1G
Something1C
Something2C
Something1G
Something1C
Somethin g1G
Something2C-
Something1G
Something1G-
Something1C
Somethin g2C
6Global alignment
- Input S1,S2
- Output Minimum cost alignment
- V(k,l) score of aligning S11..k with S21..l
- Base conditions
- V(i,0) ?k0..i?(sk,-)
- V(0,j) ?k0..j?(-,tk)
- Recurrence relation V(i-1,j-1) ?(si,tj)
- ?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
- V(i,j-1) ?(-,tj)
7Alignment reminder
- Global alignment
- All of S1 has to be aligned with all of S2
- Every gap is payed for
- Solution equals V(n,m)
Traceback all the way
Alignment score here
8Local alignment
- Local alignment
- Subset of S1 aligned with a subset of S2
- Gaps outside subsets costless
- Solution equals the maximum score cell in the DP
matrix - Base conditions
- V(i,0) 0
- V(0,j) 0
- Recurrence relation V(i-1,j-1) ?(si,tj)
- ?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
- V(i,j-1) ?(-,tj)
- 0
9Ends-free alignment
- Something between global and local
- Consider aligning a gene to a (bacterial) genome
- Gaps in the beginning and end of S and T are
costless - But all of S,T should be aligned
- Base conditions
- V(i,0) 0
- V(0,j) 0
- Recurrence relation V(i-1,j-1) ?(si,tj)
- ?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
- V(i,j-1) ?(-,tj)
- The optimal solution is found at the last
row/column - (not necessarily at bottom right corner)
10Handling weird gaps
- Affine gap different cost for a new and old
gaps
Something1G
Something1G
Something1C
Something2C
Something1G
Something1C
Somethin g1G
Something2C-
Now we care if there were gaps here
Two new things to keep track ? Two additional
matrices
Something1G
Something1G-
Something1C
Somethin g2C
11S.....i T.....j
G(i,j)
Alignment with Affine Gap Penalty
- Base Conditions
- V(i, 0) F(i, 0) Wg iWs
- V(0, j) E(0, j) Wg jWs
- Recursive Computation
- V(i, j) max E(i, j), F(i, j), G(i, j)
- where
- G(i, j) V(i-1, j-1) ?(si, tj)
- E(i, j) max E(i, j-1) Ws , G(i, j-1) Wg
Ws , F(i, j-1) Wg Ws - F(i, j) max F(i-1, j) Ws , G(i-1, j) Wg
Ws , E(i-1, j) Wg Ws
S.....i------ T..............j
E(i,j)
S...............i T.....j-------
- Time complexity O(nm) - compute 4 matrices
instead of one. - Space complexity O(nm) - saving 3 (Why?)
matrices. O(nm) w/ Hir.
12When do constant and affine gap costs differ?
AGAGACTGACGCTTA ATATTA
AGAGACTGACGCTTA ATA---------TTA
AGAGACTGACGCTTA ----A-T-A---TTA
Constant penalty Mismatch -5 Gap -1
-9
-14
Affine penalty Mismatch -5 Gap open -3 Gap
extend -0.5
-12
-14.5
13Bounding the number of gaps
- Lets say we are allowed to have at most K gaps
- (Gaps ? Spaces ? Gap can contain many spaces)
- Now we keep track of the number of gaps we opened
so far - Also still need to keep track of whether a gap is
currently open in S or T (E/F matrices)
14Bounding the number of gaps
- A multi-layer DP matrix
- Actually separate functions V,E,F, on every
layer, keeping track of layer no. - Every time we open or close a gap we jump to
the next layer - Where to look for the solution? (not only
- at last layer!)
- What is the complexity?
15Bounding the number of spaces
- Lets say that no gap can exceed k spaces
- Of course now cannot also bound number
- of gaps as well (why?)
- How many matrices do we need now?
- Here, no monotone notion of layer like before
- Whats the complexity?
16What about arbitrary gap functions?
- If the gap cost is an arbitrary function of its
length f(k) - Thus, when computing Dij, we need to look at j
places back and i places up - Complexity?
Something1G
Something1C
min
17Special cases
- How about a logarithmic penalty? WgWslog(k)
- This is a special case of a convex penalty, which
is solvable in O(mnlog(m)) - The logarithmic case can be done in O(mn)
- For a piece-wise linear gap function made of K
lines, DP can be done in O(mnlog(K))
18Supersequence
- Exercise A is called a non-contiguous
supersequence of B if B is a non-contiguous
subsequence of A. - e.g., YABADABADU is a non-contigous supersequence
of BABU (YABADABADU) - Given S and T, find their shortest common
supersequence
19Reminder LCS
- Longest common non-contigous subsequence
- Adjust global alignment with similarity scores
- 1 for match
- 0 for gaps
- -8 for mismatches
20Supersequence
- Find the longest common sub-sequence of S,T
- Generate the string as follows
- for every column in the alignment
- Match add the matching character (once!)
- Gap add the character aligned against the gap
21Supersequence
- For SPride TParade
- P-R-IDE
- PARA-DE
- PARAIDE Shortest common supersequence
22Exercise Finding repeats
- Basic objective find a pair of subsequences
within S with maximum similarity - Simple (albeit wrong) idea Find an optimal
alignment of S with itself! (Why wrong?) - But using local alignment is still a good idea
23Variant 1
- Specific requirement the two sequences may
overlap - Solution Change the local alignment algorithm
- Compute only the upper triangular submatrix
(V(i,j), where jgti). - Set diagonal values to 0
- Complexity O(n2) time and O(n) space
24Variant 2
- Specific requirement the two sequences may not
overlap - Solution Absence of overlap means that k exists
such that one string is in S1..k and another in
Sk1..n - Check local alignments between S1..k and
Sk1..n for any 1ltkltn - Pick the highest-scoring alignment
- Complexity O(n3) time and O(n) space
25Variant 2
26Variant 3
- Specific requirement the two sequences must be
consequtive (tandem repeat) - Solution Similar to variant 2, but somewhat
ends-free seek a global alignment between
S1..k and Sk1..n, - No penalties for gaps in the beginning of S1..k
- No penalties for gaps in the end of Sk1..n
- Complexity O(n3) time and O(n) space
27Variant 3
28Variant 4
- Specific requirement the two sequences must be
consequtive and the similarity is measured
between the first sequence and the reverse
complement of the second - SRC (inverted repeat) - Tempting (albeit wrong) to use something in the
spirit of variant 3 will give complexity O(n3)
29Variant 4
- Solution Compute the local alignment between S
and SRC - Look for results on the diagonal ijn
- AGCTAACGCGTTCGAA (n16)
-
- Complexity O(n2) time, O(n) space
?Index 8
Index 8 ?