Comp' Genomics - PowerPoint PPT Presentation

About This Presentation

Title:

Comp' Genomics

Description:

Confine the exponential number of possibilities into some 'hierarchy' ... dij(k)= min{dij(k-1), dik(k-1) dkj(k-1)} otherwise. Alignment reminder. Something1|G ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 30

Provided by: liadc

Category:

more less

Transcript and Presenter's Notes

Title: Comp' Genomics

1
Comp. Genomics

Recitation 2
12/3/09
Slides by Igor Ulitsky

2
Outline

Alignment re-cap
End-space free alignment
Affine gap alignment algorithm and proof
Bounded gap/spaces alignments

3
Dynamic programming

Useful in many string-related settings
Will be repeatedly used in the course
General idea
Confine the exponential number of possibilities
into some hierarchy, such that the number of
cases becomes polynomial

4
Dynamic programming for shortest paths

Finding the shortest path from X to Y using the
Floyd Warshall
Idea if we know what is the shortest path using
intermediate vertices 1,, k-1, computing
shortest paths using 1,, k is easy
wij if k0
dij(k) mindij(k-1), dik(k-1)dkj(k-1) otherwise

5
Alignment reminder
Something1G
Something1G
Something1C
Something2C
Something1G
Something1C
Somethin g1G
Something2C-
Something1G
Something1G-
Something1C
Somethin g2C
6
Global alignment

Input S1,S2
Output Minimum cost alignment
V(k,l) score of aligning S11..k with S21..l
Base conditions
V(i,0) ?k0..i?(sk,-)
V(0,j) ?k0..j?(-,tk)
Recurrence relation V(i-1,j-1) ?(si,tj)
?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
V(i,j-1) ?(-,tj)

7
Alignment reminder

Global alignment
All of S1 has to be aligned with all of S2
Every gap is payed for
Solution equals V(n,m)

Traceback all the way
Alignment score here
8
Local alignment

Local alignment
Subset of S1 aligned with a subset of S2
Gaps outside subsets costless
Solution equals the maximum score cell in the DP
matrix
Base conditions
V(i,0) 0
V(0,j) 0
Recurrence relation V(i-1,j-1) ?(si,tj)
?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
V(i,j-1) ?(-,tj)
0

9
Ends-free alignment

Something between global and local
Consider aligning a gene to a (bacterial) genome
Gaps in the beginning and end of S and T are
costless
But all of S,T should be aligned
Base conditions
V(i,0) 0
V(0,j) 0
Recurrence relation V(i-1,j-1) ?(si,tj)
?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
V(i,j-1) ?(-,tj)
The optimal solution is found at the last
row/column
(not necessarily at bottom right corner)

10
Handling weird gaps

Affine gap different cost for a new and old
gaps

Something1G
Something1G
Something1C
Something2C
Something1G
Something1C
Somethin g1G
Something2C-
Now we care if there were gaps here
Two new things to keep track ? Two additional
matrices
Something1G
Something1G-
Something1C
Somethin g2C
11
S.....i T.....j
G(i,j)
Alignment with Affine Gap Penalty

Base Conditions
V(i, 0) F(i, 0) Wg iWs
V(0, j) E(0, j) Wg jWs
Recursive Computation
V(i, j) max E(i, j), F(i, j), G(i, j)
where
G(i, j) V(i-1, j-1) ?(si, tj)
E(i, j) max E(i, j-1) Ws , G(i, j-1) Wg
Ws , F(i, j-1) Wg Ws
F(i, j) max F(i-1, j) Ws , G(i-1, j) Wg
Ws , E(i-1, j) Wg Ws

S.....i------ T..............j
E(i,j)
S...............i T.....j-------

Time complexity O(nm) - compute 4 matrices
instead of one.
Space complexity O(nm) - saving 3 (Why?)
matrices. O(nm) w/ Hir.

12
When do constant and affine gap costs differ?
AGAGACTGACGCTTA ATATTA

Consider

AGAGACTGACGCTTA ATA---------TTA
AGAGACTGACGCTTA ----A-T-A---TTA
Constant penalty Mismatch -5 Gap -1
-9
-14
Affine penalty Mismatch -5 Gap open -3 Gap
extend -0.5
-12
-14.5
13
Bounding the number of gaps

Lets say we are allowed to have at most K gaps
(Gaps ? Spaces ? Gap can contain many spaces)
Now we keep track of the number of gaps we opened
so far
Also still need to keep track of whether a gap is
currently open in S or T (E/F matrices)

14
Bounding the number of gaps

A multi-layer DP matrix
Actually separate functions V,E,F, on every
layer, keeping track of layer no.
Every time we open or close a gap we jump to
the next layer
Where to look for the solution? (not only
at last layer!)
What is the complexity?

15
Bounding the number of spaces

Lets say that no gap can exceed k spaces
Of course now cannot also bound number
of gaps as well (why?)
How many matrices do we need now?
Here, no monotone notion of layer like before
Whats the complexity?

16
What about arbitrary gap functions?

If the gap cost is an arbitrary function of its
length f(k)
Thus, when computing Dij, we need to look at j
places back and i places up
Complexity?

Something1G
Something1C
min
17
Special cases

How about a logarithmic penalty? WgWslog(k)
This is a special case of a convex penalty, which
is solvable in O(mnlog(m))
The logarithmic case can be done in O(mn)
For a piece-wise linear gap function made of K
lines, DP can be done in O(mnlog(K))

18
Supersequence

Exercise A is called a non-contiguous
supersequence of B if B is a non-contiguous
subsequence of A.
e.g., YABADABADU is a non-contigous supersequence
of BABU (YABADABADU)
Given S and T, find their shortest common
supersequence

19
Reminder LCS

Longest common non-contigous subsequence
Adjust global alignment with similarity scores
1 for match
0 for gaps
-8 for mismatches

20
Supersequence

Find the longest common sub-sequence of S,T
Generate the string as follows
for every column in the alignment
Match add the matching character (once!)
Gap add the character aligned against the gap

21
Supersequence

For SPride TParade
P-R-IDE
PARA-DE
PARAIDE Shortest common supersequence

22
Exercise Finding repeats

Basic objective find a pair of subsequences
within S with maximum similarity
Simple (albeit wrong) idea Find an optimal
alignment of S with itself! (Why wrong?)
But using local alignment is still a good idea

23
Variant 1

Specific requirement the two sequences may
overlap
Solution Change the local alignment algorithm
Compute only the upper triangular submatrix
(V(i,j), where jgti).
Set diagonal values to 0
Complexity O(n2) time and O(n) space

24
Variant 2

Specific requirement the two sequences may not
overlap
Solution Absence of overlap means that k exists
such that one string is in S1..k and another in
Sk1..n
Check local alignments between S1..k and
Sk1..n for any 1ltkltn
Pick the highest-scoring alignment
Complexity O(n3) time and O(n) space

25
Variant 2
26
Variant 3

Specific requirement the two sequences must be
consequtive (tandem repeat)
Solution Similar to variant 2, but somewhat
ends-free seek a global alignment between
S1..k and Sk1..n,
No penalties for gaps in the beginning of S1..k
No penalties for gaps in the end of Sk1..n
Complexity O(n3) time and O(n) space

27
Variant 3
28
Variant 4

Specific requirement the two sequences must be
consequtive and the similarity is measured
between the first sequence and the reverse
complement of the second - SRC (inverted repeat)
Tempting (albeit wrong) to use something in the
spirit of variant 3 will give complexity O(n3)

29
Variant 4