Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Algorithms and Data Structures

Description:

As the text mentions: the weights are usually derived from the PAM matrices of ... Computing End-Space Free Alignment for every pair of substrings. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 48
Provided by: john244
Learn more at: https://www.cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures


1
Bioinformatics Algorithms and Data Structures
  • Chapter 11 sections4-7
  • Lecturer Dr. Rose
  • Slides by Dr. Rose
  • February 4 6, 2003

2
Edit Graphs
  • Key idea weighted edit graph
  • Defn. Given strings S1 and S2 of lengths n and m,
    respectively, a weighted edit graph has (n1) by
    (m1) nodes, labelled (i,j) , 0 ? i ? n, 0 ? j ?
    m. The edges edge weights are problem specific.

3
Edit Graphs
  • Example edit distance problem
  • The weighted graph for the edit distance problem
    has directed edges from node (i, j) to the nodes
    (i 1, j) , (i, j 1) , and (i 1, j 1),
    provided they exist.
  • The weight of the directed edges to nodes (i 1,
    j) , (i, j 1) is 1.
  • The weight of the directed edge to (i 1, j
    1) is t(i 1, j 1).
  • Figure 11.4 in the textbook shows an edit graph.

4
Edit Graphs
  • Thm. An edit transcript for strings S1 and S2 has
    the minimum number of edit operations ? it
    corresponds to a shortest path from 0,0 to n,m in
    the edit graph.
  • Cor. The set of all shortest paths from 0,0 to
    n,m in the edit graph specifies all optimal edit
    transcript of S1 to S2.

5
Weight Edit Distance
  • There are two ways of assigning weight or costs
    to calculate edit distance
  • By edit operation
  • By alphabet, i.e., different costs for different
    characters
  • Our initial approach was to assign weight by edit
    operation, i.e., 1 for insert, delete, replace,
    and 0 for match.
  • We can generalize our approach by assigning the
    weight d for an insertion or deletion, r for a
    replacement, and e for a match.

6
Weight Edit Distance
  • QWhat values for d, r, and e have we been
    using?
  • A d 1 r 1, and e 0.
  • Q What would happen if r gt 2d?
  • A Replacements would never occur.
  • Defn. The operation-weight distance problem
    entails finding an edit transcript transforming
    S1 to S2 with the minimum total operation weight.

7
Weight Edit Distance
  • Q What changes should we make to the definition
    of edit distance, D(i,j), to reflect operation
    weight?
  • We have to specify an operation-specific
    definition.
  • The base conditions become
  • D(i,0) i d. Why?
  • D(0,j) j d. Why?

8
Weight Edit Distance
  • The general recurrence becomes
  • D(i,j) minD(i,j-1) d, D(i-1,j) d,
    D(i-1,j-1) t(i,j)
  • Where t(i,j) e if S1(i) S2 (j) o/w t(i,j)
    r
  • Q Why?
  • A the cost of
  • Delete (from i-1,j) is d
  • Insert (from i,j-1) is d
  • Match (from i-1,j-1) is e
  • Replace (from i-1,j-1) is r

9
Weight Edit Distance
  • The alternative to operation-weight edit distance
    is alphabet-weight edit distance.
  • Idea different characters have different cost.
  • Q How would we modify the edit distance
    function, D(i,j), to support alphabet-weight edit
    distance?
  • A Let weight(x) denote the weight associated
    with character x for all x in the alphabet.
  • Then D(i,0) ?weight(S1(i))
  • And D(0,j) ?weight(S2(j))
  • Q what about the general recurrence D(i,j)?

10
Weight Edit Distance
  • A D(i,j) minD(i,j-1) weight(S2(j)),
    D(i-1,j) weight(S1(i)), D(i-1,j-1) t(i,j)
  • Where t(i,j) weight(S2(j)), if S1(i) ? S2(j),
    o/w 0.
  • Note for proteins, edit distance usually refers
    to alphabet-weight edit distance.
  • As the text mentions the weights are usually
    derived from the PAM matrices of Dayhoff or the
    BLOSUM matrices of Henikoff.
  • Edit distance for DNA strings is usually either
    unweighted or operation-weighted edit distance.

11
String Similarity
  • The relatedness of two strings can be expressed
    in terms of similarity.
  • This similarity is usually expressed in terms of
    alignment rather than in terms of edit distance.
  • Defn. Let S be the alphabet for strings S1 and
    S2. Let S? be S with the additional character -
    denoting space. Let s(x,y) denote the value
    obtained by aligning character x with character y.

12
String Similarity
  • Defn. The value of alignment A is defined as

Where S1 and S2 denote strings after the
insertion of spaces and their length is denoted
by l.
If s(x,y) is greater than or equal to zero if x
y match and negative if they mismatch, then we
look for the alignment with the largest score
13
String Similarity
  • Example S a, g, c, t. Let s(x,y) be defined
    by

Q What is the value of the following
alignment? a t a - a c t g t g t a g a c - g t
14
String Similarity
  • Defn. Given a scoring matrix over S?, define the
    similarity of two strings S1 and S2 as the
    value of the alignment A that maximizes the total
    alignment value of S1 and S2 .
  • This also defines the optimal alignment value of
    the strings S1 and S2.

15
Computing Similarity
  • Q How can we compute the optimal alignment value
    of the strings S1 and S2?
  • A Use dynamic programming.
  • Defn. Let V(i,j) denote the value of the optimal
    alignment of prefixes S11..i and S21..j.
  • If strings S1 and S2 have lengths n and m,
    respectively, then the value of the optimal
    alignment of these strings is given by V(n,m).
  • Q What do you guess the time complexity will be?
  • A O(n,m)

16
Computing Similarity
The optimal alignment value relation is defined
similar to the edit distance relation. Base
Conditions
  • Define the general recurrence relation as
  • V(i,j) maxV(i - 1, j - 1) s(S1(i), S2(j)),
    V(i - 1, j ) s(S1(i),_), V(i, j - 1) s(_,
    S2(j))

17
Computing Similarity
  • V(i,j) maxV(i - 1, j - 1) s(S1(i), S2(j)),
    V(i - 1, j ) s(S1(i),_), V(i, j - 1) s(_,
    S2(j))
  • Q What does this recurrence relation say?
  • A The optimal alignment of the prefixes S11..i
    and S21..j is the maximum of
  • The optimal alignment of S11..i-1 and
    S21..j-1 extended by aligning S1(i) and S2(j).
  • The optimal alignment of S11..i-1 and S21..j
    extended by aligning S1(i) with a space.
  • The optimal alignment of S11..i and S21..j-1
    extended by aligning a space with S2(j).

18
Longest Common Subsequence
  • Defn. A subsequence of a string S, is a subset of
    characters arranged in their original relative
    order.
  • Example
  • S interdepartmentaladministratorstaskforce
  • subsequence gt idiots
  • interdepartmentaladministratorstaskforce
  • Obviously every substring of S is also a
    subsequence of S.
  • Defn. a common subsequence of two strings is a
    subsequence that appears in both strings.

19
Longest Common Subsequence
  • Defn. The longest subsequence problem entails
    finding the longest common subsequence (lcs) of
    two strings.
  • Thm. The optimal alignment of A forms a longest
    common subsequence, if a scoring scheme is use in
    which each matching pair of characters scores a 1
    and a mismatch or space scores 0.

20
Alignment Graphs
  • Like distance, similarity can be viewed as a path
    problem the graph that is analogous to the edit
    graph (section 11.4) is called an alignment
    graph.
  • Defn. An alignment graph is a DAG similar to an
    edit graph in which the edge weights correspond
    to costs for aligning specific character pairs.
  • The optimal alignment corresponds to the longest
    path, in terms of sum of edge costs, from 0,0 to
    n,m of the dynamic programming table.
  • The longest paths (optimal alignments) can be
    found in O(nm).

21
End-Space Free Alignment
  • End-space free alignment an alignment variant
    in which leading and trailing spaces contribute
    zero weight.
  • Example
  • e x a m p l e - h e c o u l d a - - - h a d a - -
    b e e r
  • - - - - - - - - h e w o u l d n t a s h o t h i s
    d e a r
  • The first eight spaces are free.
  • This encourages (biases towards)
  • Alignment of one string inside the other or
  • Alignment of the prefix of one string with the
    suffix of the other

22
End-Space Free Alignment
  • Q When should interior or prefix/suffix matching
    be preferred?
  • A When it matches the nature of the problem
    being modeled.
  • An example is shotgun sequence assembly Explain!
  • Start with a large collection of partially
    overlapping substrings that come from multiple
    copies of one original, but unknown string.
  • Use comparisons of pairs of substrings to infer
    the original string.

23
End-Space Free Alignment
  • Q Would you expect substrings that overlap in
    the original string to show significant
    alignment?
  • A Perhaps. In any case, with some slop for
    sequencing errors, either
  • one string would align inside the other or
  • the prefix of one string would align with the
    suffix of the other
  • In contrast, a significant alignment of randomly
    selected substrings from this collection is
    unlikely.
  • An End-Space Free Alignment would detect this
    difference and score overlapping substrings
    higher.

24
End-Space Free Alignment
  • We can deduce candidate neighbor pairs by
  • Computing End-Space Free Alignment for every pair
    of substrings.
  • High scoring alignments are likely neighbors.
  • To compute this
  • Use a recurrence for global alignment where
    spaces count.
  • Change the definition of V(i,0), V(0,j) to
    address leading spaces V(i,0) V(0,j) 0 for
    all i and j.
  • Compute the alignment graph in O(mn) How?

25
End-Space Free Alignment
  • Unlike global alignment the value of optimal
    alignment is not necessarily in cell (n,m).
  • The optimal alignment will now be found in
  • A cell in row n, if the last character of S1
    contributes to the value of the alignment but the
    last characters of S2 do not.
  • A cell in column m, if the last character of S2
    contributes to the value of the alignment but the
    last characters of S1 do not.
  • The optimal alignment will be the cell in row n
    or column m that has the largest value.

26
  • And now for something completely different
  • Approximate Matching

27
Approximate Matching
  • Basic idea Threshold-hold defined similarity
  • Defn. A substring T of T is an approximate
    occurrence of P ? the optimal alignment of P to
    T has value at least ?, the threshold parameter.
  • Approach
  • Use the standard recurrence for global alignment.
  • Do not charge preceding spaces V(i,0) V(0,j)
    0 for all i and j.
  • Leave backpointers while computing the table

28
Approximate Matching
  • Q How can we recognize an approximate occurrence
    of P in T from the table computation?
  • A If the length of P is n, then for some j,
    V(n,j) ? ?
  • More specifically
  • Thm. The approximate occurrence of P in T ends at
    position j of T ? V(n,j) ? ?
  • This tells us where in T the approximate
    occurrence ends. Where in T does it start?

29
Approximate Matching
  • Thm.(version 0) The approximate occurrence of P
    in T ends at position j of T ? V(n,j) ? ?
  • This tells us where in T the approximate
    occurrence ends. Where in T does it start?
  • We can find the start by following the path from
    cell (n,j) back to (0,k). k is the starting
    position in T.
  • Thm.(version 1) Tk..j is an approximate
    occurrence of P in T ? V(n,j) ? ? and there is a
    path of backpointers from (n,j) to (0,k).

30
Approximate Matching
  • The table computation takes O(nm).
  • Consider depending on the threshold d, T may
    contain a great many approximate occurrences of
    P.
  • Q Can all approximate occurrences be explicitly
    output in O(nm)?
  • A Perhaps not.
  • Textbook suggest locating all j s.t. V(n,j) ? ?
    and explicitly outputting a shortest approximate
    occurrence.
  • Traverse backpointers from (n,j) until reaching
    (0,k)
  • Choose vertical pointers over diagonal pointers
  • Choose diagonal pointers over horizontal pointers.

31
Approximate Matching
  • How does this particular preference produce a
    shortest path?
  • Choose vertical pointers over diagonal pointers
  • Choose diagonal pointers over horizontal
    pointers.
  • Recall
  • Horzontal edges correspond to inserting space in
    P, this lengthens the path. Clearly this is to be
    avoided.
  • Diagonal edges correspond to matches or
    mismatches.
  • Vertical edges correspond to inserting space in T
    .
  • There is no obvious reason for choosing diagonal
    over vertical edges, however, some preference
    must be made for tie-breaking.
  • Except choosing vertical results in match that is
    shortest in T.

32
  • Global Alignment vs Local Alignment

33
Local Alignment
  • So far we have focused on global alignment. This
    makes sense if
  • We expect one string to be contained in the other
    or
  • We expect the strings to be close related.
  • Example comparing amino acid sequences from the
    same protein family.

34
Local Alignment
  • Local alignment exposes regions of high
    similarity.
  • This may be interesting even if we expect the
    strings to be globally dissimilar.
  • Can you think of examples?
  • Comparing proteins from different protein
    families
  • How about searching for lateral gene transfer
    from prokaryotic genomes to eukaryotic genomes?
  • Huh????

35
Local Alignment
  • Local alignment problem. Find maximally similar
    (optimal global alignment) substrings a and b of
    S1 and S2, respectively.
  • Example from text S1 pqraxabcstvq, S2
    xyaxbacsll
  • a a x a b - c s
  • b a x - b a c s
  • This global alignment is predicated on
  • a score of 2 for a match
  • a score of 2 for a mismatch
  • a score of 1 for a space
  • Resulting in a value of 8.

36
Computing Local Alignment
  • Q How can local alignment be computed?
  • Q Can global alignment be used to find local
    alignment?
  • A Not efficiently. Global alignment effectively
    averages out local similarity.
  • Use explicit search for local similarity.

37
Computing Local Alignment
  • Q Assuming S1 and S2 have respective lengths n
    and m, how many pairs of substrings are there?
  • A There are O(n2m2) pairs of substrings.
  • Q If we wanted to, how could we show there are
    this many substrings?

38
Computing Local Alignment
  • Observation Computing global alignment for each
    of the O(n2m2) pairs of substrings gt O(nm).
  • Surprisingly, we can compute local alignment in
    O(nm) even though there are O(n2m2) pairs of
    substrings.
  • Assumption the global alignment of two empty
    strings has value zero.

39
Computing Local Alignment
  • First consider a restricted version of local
    alignment.
  • Defn. The local suffix alignment problem entails
    finding a suffix a of S11..i and a suffix b of
    S21..j s.t. V(a,b) is the maximum over all
    pairs of suffixes of S11..i and S21..j.
  • Let v(i,j) denote the value of the optimal suffix
    alignment for the index pair i,j.

40
Computing Local Alignment
  • Local suffix alignment example
  • S1 abcxdex, S2 xxcxdeabc, Score 2 for matches
    and 1 for mismatches or spaces
  • v(3,4) 1, how?
  • The cs match but there is an additional -
    aligned with x.
  • v(4,4) 4, how?
  • The cs match and the final xs match
  • v(5,4) 3, how?
  • Same as v(4,4) but extended with d aligned with
    -

41
Computing Local Alignment
  • Observation v(i,j) ? 0.
  • Q Why is this true?
  • A We can always choose a and/or b to be the
    empty string.
  • Let v denote the value of optimal local
    alignment for strings of length n and m.
  • Thm. v maxv(i,j) i ? n, j ? m

42
Computing Local Alignment
  • We need to understand why this theorem,
    v maxv(i,j) i ? n, j ? m , is true.
  • Proof ?
  • v ? maxv(i,j) i ? n, j ? m since any local
    optimal suffix alignment is also a local
    alignment.

43
Computing Local Alignment
  • ?
  • WLOG assume v is derived from the optimal
    solution involving substrings a and b with end
    indices i and j, a and b define the local
    suffix alignment for indices i and j, thus v ?
    v(i,j) ? maxv(i,j) i ? n, j ? m
  • From this it is clear that a solution to the
    local suffix alignment problem also solves the
    local alignment problem.

44
Computing Local Alignment
  • Thm. v(i,j) max0, v(i 1, j - 1) s(S1(i),
    S2 (j)), v(i 1, j) s(S1(i), _), v(i, j -
    1) s(_, S2 (j))
  • Where v(i, 0) 0 and v(0, j) 0 for all i,j
  • Q What does this recurrence say?
  • A The solution to the local alignment problem
    v(i, j) is the larger of
  • 0, punt and choose a and b to be empty strings
  • v(i 1, j - 1) extended by aligning S1(i) and
    S2 (j)
  • v(i 1, j) extended by aligning S1(i) with _
  • v(i, j - 1) extended by aligning _ with S2 (j)

45
Computing Local Alignment
  • Q What is the difference between the equations
    for global alignment and local suffix alignment?
  • A There are two differences
  • The inclusion of 0 in the local local suffix
    alignment
  • The base conditions for local suffix alignment
    v(i,0) 0 and v(0,j) 0 for all i,j.This is
    similar for finding approximate occurrences but
    not for general global alignment.

46
Computing Local Alignment
  • Approach to computing v
  • Compute the table for v(i, j).
  • Search the entire table for the largest value,
    let (i, j) denote the cell containing the
    largest value.
  • Follow backpointers from cell (i, j) to cell
    (i, j) which has the value zero. This gives the
    optimal local alignment.
  • The local optimal alignment substrings are then a
    S1(i.. i and b S2(j.. j

47
Computing Local Alignment
  • Analysis of computing v
  • We know that computing the table to solve v
    takes time O(nm).
  • The table contains all optimal local alignments
    for v(i, j). An alignment can be found by
    locating a cell with v and tracing back from it.
Write a Comment
User Comments (0)
About PowerShow.com