Comp. Genomics - PowerPoint PPT Presentation

About This Presentation
Title:

Comp. Genomics

Description:

Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky Outline Alignment re-cap End-space free alignment Affine gap alignment algorithm and proof Bounded gap ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 30
Provided by: LiadC4
Category:

less

Transcript and Presenter's Notes

Title: Comp. Genomics


1
Comp. Genomics
  • Recitation 2
  • 12/3/09
  • Slides by Igor Ulitsky

2
Outline
  • Alignment re-cap
  • End-space free alignment
  • Affine gap alignment algorithm and proof
  • Bounded gap/spaces alignments

3
Dynamic programming
  • Useful in many string-related settings
  • Will be repeatedly used in the course
  • General idea
  • Confine the exponential number of possibilities
    into some hierarchy, such that the number of
    cases becomes polynomial

4
Dynamic programming for shortest paths
  • Finding the shortest path from X to Y using the
    Floyd Warshall
  • Idea if we know what is the shortest path using
    intermediate vertices 1,, k-1, computing
    shortest paths using 1,, k is easy
  • wij if k0
  • dij(k) mindij(k-1), dik(k-1)dkj(k-1) otherwise

5
Alignment reminder
Something1G
Something1G
Something1C
Something2C
Something1G
Something1C
Somethin g1G
Something2C-
Something1G
Something1G-
Something1C
Somethin g2C
6
Global alignment
  • Input S1,S2
  • Output Minimum cost alignment
  • V(k,l) score of aligning S11..k with S21..l
  • Base conditions
  • V(i,0) ?k0..i?(sk,-)
  • V(0,j) ?k0..j?(-,tk)
  • Recurrence relation V(i-1,j-1) ?(si,tj)
  • ?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
  • V(i,j-1) ?(-,tj)

7
Alignment reminder
  • Global alignment
  • All of S1 has to be aligned with all of S2
  • Every gap is payed for
  • Solution equals V(n,m)

Traceback all the way
Alignment score here
8
Local alignment
  • Local alignment
  • Subset of S1 aligned with a subset of S2
  • Gaps outside subsets costless
  • Solution equals the maximum score cell in the DP
    matrix
  • Base conditions
  • V(i,0) 0
  • V(0,j) 0
  • Recurrence relation V(i-1,j-1) ?(si,tj)
  • ?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
  • V(i,j-1) ?(-,tj)
  • 0

9
Ends-free alignment
  • Something between global and local
  • Consider aligning a gene to a (bacterial) genome
  • Gaps in the beginning and end of S and T are
    costless
  • But all of S,T should be aligned
  • Base conditions
  • V(i,0) 0
  • V(0,j) 0
  • Recurrence relation V(i-1,j-1) ?(si,tj)
  • ?1?i?n, 1?j?m V(i,j) max V(i-1,j) ?(si,-)
  • V(i,j-1) ?(-,tj)
  • The optimal solution is found at the last
    row/column
  • (not necessarily at bottom right corner)

10
Handling weird gaps
  • Affine gap different cost for a new and old
    gaps

Something1G
Something1G
Something1C
Something2C
Something1G
Something1C
Somethin g1G
Something2C-
Now we care if there were gaps here
Two new things to keep track ? Two additional
matrices
Something1G
Something1G-
Something1C
Somethin g2C
11
S.....i T.....j
G(i,j)
Alignment with Affine Gap Penalty
  • Base Conditions
  • V(i, 0) F(i, 0) Wg iWs
  • V(0, j) E(0, j) Wg jWs
  • Recursive Computation
  • V(i, j) max E(i, j), F(i, j), G(i, j)
  • where
  • G(i, j) V(i-1, j-1) ?(si, tj)
  • E(i, j) max E(i, j-1) Ws , G(i, j-1) Wg
    Ws , F(i, j-1) Wg Ws
  • F(i, j) max F(i-1, j) Ws , G(i-1, j) Wg
    Ws , E(i-1, j) Wg Ws

S.....i------ T..............j
E(i,j)
S...............i T.....j-------
  • Time complexity O(nm) - compute 4 matrices
    instead of one.
  • Space complexity O(nm) - saving 3 (Why?)
    matrices. O(nm) w/ Hir.

12
When do constant and affine gap costs differ?
AGAGACTGACGCTTA ATATTA
  • Consider

AGAGACTGACGCTTA ATA---------TTA
AGAGACTGACGCTTA ----A-T-A---TTA
Constant penalty Mismatch -5 Gap -1
-9
-14
Affine penalty Mismatch -5 Gap open -3 Gap
extend -0.5
-12
-14.5
13
Bounding the number of gaps
  • Lets say we are allowed to have at most K gaps
  • (Gaps ? Spaces ? Gap can contain many spaces)
  • Now we keep track of the number of gaps we opened
    so far
  • Also still need to keep track of whether a gap is
    currently open in S or T (E/F matrices)

14
Bounding the number of gaps
  • A multi-layer DP matrix
  • Actually separate functions V,E,F, on every
    layer, keeping track of layer no.
  • Every time we open or close a gap we jump to
    the next layer
  • Where to look for the solution? (not only
  • at last layer!)
  • What is the complexity?

15
Bounding the number of spaces
  • Lets say that no gap can exceed k spaces
  • Of course now cannot also bound number
  • of gaps as well (why?)
  • How many matrices do we need now?
  • Here, no monotone notion of layer like before
  • Whats the complexity?

16
What about arbitrary gap functions?
  • If the gap cost is an arbitrary function of its
    length f(k)
  • Thus, when computing Dij, we need to look at j
    places back and i places up
  • Complexity?

Something1G
Something1C
min
17
Special cases
  • How about a logarithmic penalty? WgWslog(k)
  • This is a special case of a convex penalty, which
    is solvable in O(mnlog(m))
  • The logarithmic case can be done in O(mn)
  • For a piece-wise linear gap function made of K
    lines, DP can be done in O(mnlog(K))

18
Supersequence
  • Exercise A is called a non-contiguous
    supersequence of B if B is a non-contiguous
    subsequence of A.
  • e.g., YABADABADU is a non-contigous supersequence
    of BABU (YABADABADU)
  • Given S and T, find their shortest common
    supersequence

19
Reminder LCS
  • Longest common non-contigous subsequence
  • Adjust global alignment with similarity scores
  • 1 for match
  • 0 for gaps
  • -8 for mismatches

20
Supersequence
  • Find the longest common sub-sequence of S,T
  • Generate the string as follows
  • for every column in the alignment
  • Match add the matching character (once!)
  • Gap add the character aligned against the gap

21
Supersequence
  • For SPride TParade
  • P-R-IDE
  • PARA-DE
  • PARAIDE Shortest common supersequence

22
Exercise Finding repeats
  • Basic objective find a pair of subsequences
    within S with maximum similarity
  • Simple (albeit wrong) idea Find an optimal
    alignment of S with itself! (Why wrong?)
  • But using local alignment is still a good idea

23
Variant 1
  • Specific requirement the two sequences may
    overlap
  • Solution Change the local alignment algorithm
  • Compute only the upper triangular submatrix
    (V(i,j), where jgti).
  • Set diagonal values to 0
  • Complexity O(n2) time and O(n) space

24
Variant 2
  • Specific requirement the two sequences may not
    overlap
  • Solution Absence of overlap means that k exists
    such that one string is in S1..k and another in
    Sk1..n
  • Check local alignments between S1..k and
    Sk1..n for any 1ltkltn
  • Pick the highest-scoring alignment
  • Complexity O(n3) time and O(n) space

25
Variant 2
26
Variant 3
  • Specific requirement the two sequences must be
    consequtive (tandem repeat)
  • Solution Similar to variant 2, but somewhat
    ends-free seek a global alignment between
    S1..k and Sk1..n,
  • No penalties for gaps in the beginning of S1..k
  • No penalties for gaps in the end of Sk1..n
  • Complexity O(n3) time and O(n) space

27
Variant 3
28
Variant 4
  • Specific requirement the two sequences must be
    consequtive and the similarity is measured
    between the first sequence and the reverse
    complement of the second - SRC (inverted repeat)
  • Tempting (albeit wrong) to use something in the
    spirit of variant 3 will give complexity O(n3)

29
Variant 4
  • Solution Compute the local alignment between S
    and SRC
  • Look for results on the diagonal ijn
  • AGCTAACGCGTTCGAA (n16)
  • Complexity O(n2) time, O(n) space

?Index 8
Index 8 ?
Write a Comment
User Comments (0)
About PowerShow.com