Bioinformatics Algorithms and Data Structures

About This Presentation

Title:

Bioinformatics Algorithms and Data Structures

Description:

As the text mentions: the weights are usually derived from the PAM matrices of ... Computing End-Space Free Alignment for every pair of substrings. ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 48

Provided by: john244

Learn more at: https://www.cse.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures

1
Bioinformatics Algorithms and Data Structures

Chapter 11 sections4-7
Lecturer Dr. Rose
Slides by Dr. Rose
February 4 6, 2003

2
Edit Graphs

Key idea weighted edit graph
Defn. Given strings S1 and S2 of lengths n and m,
respectively, a weighted edit graph has (n1) by
(m1) nodes, labelled (i,j) , 0 ? i ? n, 0 ? j ?
m. The edges edge weights are problem specific.

3
Edit Graphs

Example edit distance problem
The weighted graph for the edit distance problem
has directed edges from node (i, j) to the nodes
(i 1, j) , (i, j 1) , and (i 1, j 1),
provided they exist.
The weight of the directed edges to nodes (i 1,
j) , (i, j 1) is 1.
The weight of the directed edge to (i 1, j
1) is t(i 1, j 1).
Figure 11.4 in the textbook shows an edit graph.

4
Edit Graphs

Thm. An edit transcript for strings S1 and S2 has
the minimum number of edit operations ? it
corresponds to a shortest path from 0,0 to n,m in
the edit graph.
Cor. The set of all shortest paths from 0,0 to
n,m in the edit graph specifies all optimal edit
transcript of S1 to S2.

5
Weight Edit Distance

There are two ways of assigning weight or costs
to calculate edit distance
By edit operation
By alphabet, i.e., different costs for different
characters
Our initial approach was to assign weight by edit
operation, i.e., 1 for insert, delete, replace,
and 0 for match.
We can generalize our approach by assigning the
weight d for an insertion or deletion, r for a
replacement, and e for a match.

6
Weight Edit Distance

QWhat values for d, r, and e have we been
using?
A d 1 r 1, and e 0.
Q What would happen if r gt 2d?
A Replacements would never occur.
Defn. The operation-weight distance problem
entails finding an edit transcript transforming
S1 to S2 with the minimum total operation weight.

7
Weight Edit Distance

Q What changes should we make to the definition
of edit distance, D(i,j), to reflect operation
weight?
We have to specify an operation-specific
definition.
The base conditions become
D(i,0) i d. Why?
D(0,j) j d. Why?

8
Weight Edit Distance

The general recurrence becomes
D(i,j) minD(i,j-1) d, D(i-1,j) d,
D(i-1,j-1) t(i,j)
Where t(i,j) e if S1(i) S2 (j) o/w t(i,j)
r
Q Why?
A the cost of
Delete (from i-1,j) is d
Insert (from i,j-1) is d
Match (from i-1,j-1) is e
Replace (from i-1,j-1) is r

9
Weight Edit Distance

The alternative to operation-weight edit distance
is alphabet-weight edit distance.
Idea different characters have different cost.
Q How would we modify the edit distance
function, D(i,j), to support alphabet-weight edit
distance?
A Let weight(x) denote the weight associated
with character x for all x in the alphabet.
Then D(i,0) ?weight(S1(i))
And D(0,j) ?weight(S2(j))
Q what about the general recurrence D(i,j)?

10
Weight Edit Distance

A D(i,j) minD(i,j-1) weight(S2(j)),
D(i-1,j) weight(S1(i)), D(i-1,j-1) t(i,j)
Where t(i,j) weight(S2(j)), if S1(i) ? S2(j),
o/w 0.
Note for proteins, edit distance usually refers
to alphabet-weight edit distance.
As the text mentions the weights are usually
derived from the PAM matrices of Dayhoff or the
BLOSUM matrices of Henikoff.
Edit distance for DNA strings is usually either
unweighted or operation-weighted edit distance.

11
String Similarity

The relatedness of two strings can be expressed
in terms of similarity.
This similarity is usually expressed in terms of
alignment rather than in terms of edit distance.
Defn. Let S be the alphabet for strings S1 and
S2. Let S? be S with the additional character -
denoting space. Let s(x,y) denote the value
obtained by aligning character x with character y.

12
String Similarity

Defn. The value of alignment A is defined as

Where S1 and S2 denote strings after the
insertion of spaces and their length is denoted
by l.
If s(x,y) is greater than or equal to zero if x
y match and negative if they mismatch, then we
look for the alignment with the largest score
13
String Similarity

Example S a, g, c, t. Let s(x,y) be defined
by

Q What is the value of the following
alignment? a t a - a c t g t g t a g a c - g t
14
String Similarity

Defn. Given a scoring matrix over S?, define the
similarity of two strings S1 and S2 as the
value of the alignment A that maximizes the total
alignment value of S1 and S2 .
This also defines the optimal alignment value of
the strings S1 and S2.

15
Computing Similarity

Q How can we compute the optimal alignment value
of the strings S1 and S2?
A Use dynamic programming.
Defn. Let V(i,j) denote the value of the optimal
alignment of prefixes S11..i and S21..j.
If strings S1 and S2 have lengths n and m,
respectively, then the value of the optimal
alignment of these strings is given by V(n,m).
Q What do you guess the time complexity will be?
A O(n,m)

16
Computing Similarity
The optimal alignment value relation is defined
similar to the edit distance relation. Base
Conditions

Define the general recurrence relation as
V(i,j) maxV(i - 1, j - 1) s(S1(i), S2(j)),
V(i - 1, j ) s(S1(i),_), V(i, j - 1) s(_,
S2(j))

17
Computing Similarity

V(i,j) maxV(i - 1, j - 1) s(S1(i), S2(j)),
V(i - 1, j ) s(S1(i),_), V(i, j - 1) s(_,
S2(j))
Q What does this recurrence relation say?
A The optimal alignment of the prefixes S11..i
and S21..j is the maximum of
The optimal alignment of S11..i-1 and
S21..j-1 extended by aligning S1(i) and S2(j).
The optimal alignment of S11..i-1 and S21..j
extended by aligning S1(i) with a space.
The optimal alignment of S11..i and S21..j-1
extended by aligning a space with S2(j).

18
Longest Common Subsequence

Defn. A subsequence of a string S, is a subset of
characters arranged in their original relative
order.
Example
S interdepartmentaladministratorstaskforce
subsequence gt idiots
interdepartmentaladministratorstaskforce
Obviously every substring of S is also a
subsequence of S.
Defn. a common subsequence of two strings is a
subsequence that appears in both strings.

19
Longest Common Subsequence

Defn. The longest subsequence problem entails
finding the longest common subsequence (lcs) of
two strings.
Thm. The optimal alignment of A forms a longest
common subsequence, if a scoring scheme is use in
which each matching pair of characters scores a 1
and a mismatch or space scores 0.

20
Alignment Graphs

Like distance, similarity can be viewed as a path
problem the graph that is analogous to the edit
graph (section 11.4) is called an alignment
graph.
Defn. An alignment graph is a DAG similar to an
edit graph in which the edge weights correspond
to costs for aligning specific character pairs.
The optimal alignment corresponds to the longest
path, in terms of sum of edge costs, from 0,0 to
n,m of the dynamic programming table.
The longest paths (optimal alignments) can be
found in O(nm).

21
End-Space Free Alignment

End-space free alignment an alignment variant
in which leading and trailing spaces contribute
zero weight.
Example
e x a m p l e - h e c o u l d a - - - h a d a - -
b e e r
- - - - - - - - h e w o u l d n t a s h o t h i s
d e a r
The first eight spaces are free.
This encourages (biases towards)
Alignment of one string inside the other or
Alignment of the prefix of one string with the
suffix of the other

22
End-Space Free Alignment

Q When should interior or prefix/suffix matching
be preferred?
A When it matches the nature of the problem
being modeled.
An example is shotgun sequence assembly Explain!
Start with a large collection of partially
overlapping substrings that come from multiple
copies of one original, but unknown string.
Use comparisons of pairs of substrings to infer
the original string.

23
End-Space Free Alignment

Q Would you expect substrings that overlap in
the original string to show significant
alignment?
A Perhaps. In any case, with some slop for
sequencing errors, either
one string would align inside the other or
the prefix of one string would align with the
suffix of the other
In contrast, a significant alignment of randomly
selected substrings from this collection is
unlikely.
An End-Space Free Alignment would detect this
difference and score overlapping substrings
higher.

24
End-Space Free Alignment

We can deduce candidate neighbor pairs by
Computing End-Space Free Alignment for every pair
of substrings.
High scoring alignments are likely neighbors.
To compute this
Use a recurrence for global alignment where
spaces count.
Change the definition of V(i,0), V(0,j) to
address leading spaces V(i,0) V(0,j) 0 for
all i and j.
Compute the alignment graph in O(mn) How?

25
End-Space Free Alignment

Unlike global alignment the value of optimal
alignment is not necessarily in cell (n,m).
The optimal alignment will now be found in
A cell in row n, if the last character of S1
contributes to the value of the alignment but the
last characters of S2 do not.
A cell in column m, if the last character of S2
contributes to the value of the alignment but the
last characters of S1 do not.
The optimal alignment will be the cell in row n
or column m that has the largest value.

And now for something completely different
Approximate Matching

27
Approximate Matching

Basic idea Threshold-hold defined similarity
Defn. A substring T of T is an approximate
occurrence of P ? the optimal alignment of P to
T has value at least ?, the threshold parameter.
Approach
Use the standard recurrence for global alignment.
Do not charge preceding spaces V(i,0) V(0,j)
0 for all i and j.
Leave backpointers while computing the table

28
Approximate Matching

Q How can we recognize an approximate occurrence
of P in T from the table computation?
A If the length of P is n, then for some j,
V(n,j) ? ?
More specifically
Thm. The approximate occurrence of P in T ends at
position j of T ? V(n,j) ? ?
This tells us where in T the approximate
occurrence ends. Where in T does it start?

29
Approximate Matching

Thm.(version 0) The approximate occurrence of P
in T ends at position j of T ? V(n,j) ? ?
This tells us where in T the approximate
occurrence ends. Where in T does it start?
We can find the start by following the path from
cell (n,j) back to (0,k). k is the starting
position in T.
Thm.(version 1) Tk..j is an approximate
occurrence of P in T ? V(n,j) ? ? and there is a
path of backpointers from (n,j) to (0,k).

30
Approximate Matching

The table computation takes O(nm).
Consider depending on the threshold d, T may
contain a great many approximate occurrences of
P.
Q Can all approximate occurrences be explicitly
output in O(nm)?
A Perhaps not.
Textbook suggest locating all j s.t. V(n,j) ? ?
and explicitly outputting a shortest approximate
occurrence.
Traverse backpointers from (n,j) until reaching
(0,k)
Choose vertical pointers over diagonal pointers
Choose diagonal pointers over horizontal pointers.

31
Approximate Matching

How does this particular preference produce a
shortest path?
Choose vertical pointers over diagonal pointers
Choose diagonal pointers over horizontal
pointers.
Recall
Horzontal edges correspond to inserting space in
P, this lengthens the path. Clearly this is to be
avoided.
Diagonal edges correspond to matches or
mismatches.
Vertical edges correspond to inserting space in T
.
There is no obvious reason for choosing diagonal
over vertical edges, however, some preference
must be made for tie-breaking.
Except choosing vertical results in match that is
shortest in T.

Global Alignment vs Local Alignment

33
Local Alignment

So far we have focused on global alignment. This
makes sense if
We expect one string to be contained in the other
or
We expect the strings to be close related.
Example comparing amino acid sequences from the
same protein family.

34
Local Alignment

Local alignment exposes regions of high
similarity.
This may be interesting even if we expect the
strings to be globally dissimilar.
Can you think of examples?
Comparing proteins from different protein
families
How about searching for lateral gene transfer
from prokaryotic genomes to eukaryotic genomes?
Huh????

35
Local Alignment

Local alignment problem. Find maximally similar
(optimal global alignment) substrings a and b of
S1 and S2, respectively.
Example from text S1 pqraxabcstvq, S2
xyaxbacsll
a a x a b - c s
b a x - b a c s
This global alignment is predicated on
a score of 2 for a match
a score of 2 for a mismatch
a score of 1 for a space
Resulting in a value of 8.

36
Computing Local Alignment

Q How can local alignment be computed?
Q Can global alignment be used to find local
alignment?
A Not efficiently. Global alignment effectively
averages out local similarity.
Use explicit search for local similarity.

37
Computing Local Alignment

Q Assuming S1 and S2 have respective lengths n
and m, how many pairs of substrings are there?
A There are O(n2m2) pairs of substrings.
Q If we wanted to, how could we show there are
this many substrings?

38
Computing Local Alignment

Observation Computing global alignment for each
of the O(n2m2) pairs of substrings gt O(nm).
Surprisingly, we can compute local alignment in
O(nm) even though there are O(n2m2) pairs of
substrings.
Assumption the global alignment of two empty
strings has value zero.

39
Computing Local Alignment

First consider a restricted version of local
alignment.
Defn. The local suffix alignment problem entails
finding a suffix a of S11..i and a suffix b of
S21..j s.t. V(a,b) is the maximum over all
pairs of suffixes of S11..i and S21..j.
Let v(i,j) denote the value of the optimal suffix
alignment for the index pair i,j.

40
Computing Local Alignment

Local suffix alignment example
S1 abcxdex, S2 xxcxdeabc, Score 2 for matches
and 1 for mismatches or spaces
v(3,4) 1, how?
The cs match but there is an additional -
aligned with x.
v(4,4) 4, how?
The cs match and the final xs match
v(5,4) 3, how?
Same as v(4,4) but extended with d aligned with
-

41
Computing Local Alignment

Observation v(i,j) ? 0.
Q Why is this true?
A We can always choose a and/or b to be the
empty string.
Let v denote the value of optimal local
alignment for strings of length n and m.
Thm. v maxv(i,j) i ? n, j ? m

42
Computing Local Alignment

We need to understand why this theorem,
v maxv(i,j) i ? n, j ? m , is true.
Proof ?
v ? maxv(i,j) i ? n, j ? m since any local
optimal suffix alignment is also a local
alignment.

43
Computing Local Alignment

?
WLOG assume v is derived from the optimal
solution involving substrings a and b with end
indices i and j, a and b define the local
suffix alignment for indices i and j, thus v ?
v(i,j) ? maxv(i,j) i ? n, j ? m
From this it is clear that a solution to the
local suffix alignment problem also solves the
local alignment problem.

44
Computing Local Alignment

Thm. v(i,j) max0, v(i 1, j - 1) s(S1(i),
S2 (j)), v(i 1, j) s(S1(i), _), v(i, j -
1) s(_, S2 (j))
Where v(i, 0) 0 and v(0, j) 0 for all i,j
Q What does this recurrence say?
A The solution to the local alignment problem
v(i, j) is the larger of
0, punt and choose a and b to be empty strings
v(i 1, j - 1) extended by aligning S1(i) and
S2 (j)
v(i 1, j) extended by aligning S1(i) with _
v(i, j - 1) extended by aligning _ with S2 (j)

45
Computing Local Alignment

Q What is the difference between the equations
for global alignment and local suffix alignment?
A There are two differences
The inclusion of 0 in the local local suffix
alignment
The base conditions for local suffix alignment
v(i,0) 0 and v(0,j) 0 for all i,j.This is
similar for finding approximate occurrences but
not for general global alignment.

46
Computing Local Alignment

Approach to computing v
Compute the table for v(i, j).
Search the entire table for the largest value,
let (i, j) denote the cell containing the
largest value.
Follow backpointers from cell (i, j) to cell
(i, j) which has the value zero. This gives the
optimal local alignment.
The local optimal alignment substrings are then a
S1(i.. i and b S2(j.. j

47
Computing Local Alignment

Analysis of computing v
We know that computing the table to solve v
takes time O(nm).
The table contains all optimal local alignments
for v(i, j). An alignment can be found by
locating a cell with v and tracing back from it.

Write a Comment

User Comments (0)