Title: Sequence Alignment
1Sequence Alignment
2Outline
- Applying Manhattan Tourist Problem to sequence
comparison - Global Alignment
- Scoring Matrices
- Local Alignment
- Alignment with Affine Gap Penalties
3Align Two Strings
- Given the strings of DNA
- v ATGTTAT
- w ATCGTAC
- One Possible Alignment of the strings
- AT_GTTAT_
- ATCGT_A_C
4Align Two Strings (contd)
Each of the two rows of the alignment is
represented by a string of letters with space
symbols, -
5Align Two Strings (contd)
Another way to represent each row shows the
number of symbols of the sequence present up to a
given position. For example the above sequences
can be represented as
0 1 2 2 3 4 5 6 7 7
AT_GTTAT_ ATCGT_A_C
0 1 2 3 4 5 5 6 6 7
6Alignment Matrix
Both rows of the alignment can be represented in
the resulting matrix
0 1 2 2 3 4 5 6 7 7
0 1 2 3 4 5 5 6 6 7
Each column in this matrix is a coordinate in a
two-dimensional nxm grid
7Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1)
8Alignment as a Path in the Edit Graph
1
0
2
3
4
5
6
7
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2)
0
1
2
3
4
5
6
7
9Alignment as a Path in the Edit Graph
1
0
2
3
4
5
6
7
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4)
0
1
2
3
4
5
6
7
10Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),
(7,7)
- End Result -
11Alignments in Edit Graph (contd)
- and represent indels in v and w
- Score 0.
- represent exact matches.
- Score 1.
12Alignments in Edit Graph (contd)
The score of the alignment path in the graph is
5.
13Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an
alignment
14Alignment as a Path in the Edit Graph
Old Alignment 0122345677 v AT_GTTAT_ w
ATCGT_A_C 0123455667
New Alignment 0122345677 v AT_GTTAT_ w
ATCG_TA_C 0123445667
15Alignment as a Path in the Edit Graph
0122345677 v AT_GTTAT_ w ATCGT_A_C
0123455667 (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
16Alignment Dynamic Programming
17Dynamic Programming Example
- There are no matches in the beginning of the
sequence - Label column i1 to be all zero, and row j1 to
be all zero
0
0
0
0
0
0
0
0
18Dynamic Programming Example
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
?0 1, if vi wj ? value from top ? value from
left
1
1
1
1
1
1
19Alignment Backtracking
- Arrows show where the score
originated from. - if from the top
- if from the left
- if vi wj
20Backtracking Example
Find a match in row and column 2. i2, j2,5 is
a match (T). j2, i4,5,7 is
a match (T). Since vi wj, S(i,j) Si-1,j-1
1 S(2,2) S(1,1) 1 1 S(2,5) S(1,4)
1 1 S(4,2) S(3,1) 1 1 S(5,2) S(4,1)
1 1 S(7,2) S(6,1) 1 1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
2
21Backtracking Example
0
0
0
0
0
0
0
0
Continuing with the scoring algorithm gives this
result.
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
2
3
3
3
3
1
2
2
3
4
4
4
1
2
2
3
4
4
4
1
2
2
3
4
5
5
1
2
2
3
4
5
5
22The LCS Problem
- The previous example was a solution to the
Longest Common Subsequence (LCS) problemthe
simplest form of a sequence similarity analysis. - To solve the alignment we eliminate mismatches
and allow only insertions and deletions.
23The LCS Problem (contd)
- Find the longest subsequence common to two
strings. - Input Two strings, v and w.
- Output The longest common subsequence of v
and w.
24The LCS Recurrence
- The score for vertex si,j is the same as in the
previous example
Si-1, j-11 if vi
wj Si,j max Si-1, j
Si, j-1
25The LCS Recurrence Revisited
- This can be rewritten by adding zero to the edges
that come from an indel, since the penalty of
indels are 0
Si-1, j-11 if vi
wj Si,j max Si-1, j 0
Si, j-1 0
26LCS Algorithm
- LCS(v,w)
- for i ? 1 to n
- Si,0 ? 0
- for j ? 1 to m
- S0,j ? 0
- for i ? 1 to n
- for j ? 1 to m
- si-1,j
- si,j ? max si,j-1
- si-1,j-1 1, if vi wj
- if si,j si-1,j
- bi,j ? if si,j si,j-1
- if si,j
si-1,j-1 1 - return (sn,m, b)
27Now What?
- LCS(v,w) created the alignment grid
- Now we need a way to read the best alignment of v
and w - Follow the arrows backwards from sink
28Printing the LCS
- PrintLCS(b,v,i,j)
- if i 0 or j 0
- return
- if bi,j
- PrintLCS(b,v,i-1,j-1)
- print vi
- else
- if bi,j
- PrintLCS(b,v,i-1,j)
- else
- PrintLCS(b,v,i,j-1)
29LCS Runtime
- To create the nxm matrix of best scores from
vertex (0,0) to all other vertices, it takes
O(nm) amount of time. - Why O(nm)? The pseudocode consists of a nested
for loop inside of another for loop to set up
a nxm matrix. - This sets up a value wj for every value vi.
30Change up the Scoring
- In the LCS Problem, we scored 1 from matches and
0 for indels in the alignment - Consider penalizing indels and mismatches with
negative scores
31The Global Alignment Problem
- Find the best alignment between two strings under
a given scoring matrix - Input Strings v w and a scoring matrix
- Output Alignment of maximum score
- ?? -?
- 1 if matching
- -µ if mismatching
- si-1,j-1 1 if vi wj
- si,j max s i-1,j-1 -µ if vi ? wj
- s i-1,j - d
- s i,j-1 - d
m mismatch d indel
32Scoring Matrices
- To generalize scoring, consider a (41) x(41)
scoring matrix d. - In the case of an amino acid alignment, the
scoring matrix would be a (201)x(201) size.
The addition of 1 is to include the score with
comparison of a gap character -. - This will simplify the scoring algorithm as
follows - si-1,j-1 d (vi, wj)
- si,j max s i-1,j d (vi, -)
- s i,j-1 d (-, wj)
33Simple Scoring
- When mismatches are penalized by some constant
µ, indels are penalized by some other constant
s, and matches are rewarded with 1, the
resulting score is - matches µ(mismatches) s (indels)
34Making a Scoring Matrix
- Scoring matrices are created based on biological
evidence. - Alignments can be thought of as two sequences
that differ due to mutations in the sequence. - Some of these mutations have little effect on the
organisms function, therefore some penalties,
d(vi , wj), will be less harsh than others.
35Scoring Matrix Example
- Notice that although R and K are different amino
acids, they have a positive score. - Why? They are both positively charged amino
acids? will not greatly change function of
protein.
36The Blosum50 Scoring Matrix
37Local vs. Global Alignment
- The Global Alignment Problem tries to find the
longest path between vertices (0,0) and (n,m) in
the edit graph. - The Local Alignment Problem tries to find the
longest path among paths between arbitrary
vertices (i,j) and (i, j) in the edit graph.
38Local vs. Global Alignment (contd)
- Global Alignment
- Local Alignmentbetter alignment to find
conserved segment
--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-G
AC
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT-
-C
tccCAGTTATGTCAGgggacacgagcatgcagag
ac
aattgccgccgtcgttttcagCAGTTATGTCAGatc
39Local Alignment Example
Local alignment
Global alignment
40Local Alignments Why?
- Some genes only have small conserved regions
between species of organisms - Example
- Homeobox genes have a short region called the
homeodomain that is highly conserved between
species. - A global alignment would not find the homeodomain
because it would try to align the ENTIRE sequence
41The Local Alignment Problem
- Goal To find the best local alignment of strings
v and w - Input Strings v, w and scoring matrix d
- Output alignment of substrings of v w whose
alignment score is maximum among all possible
alignment of all possible substrings
42The Problem with this Problem
- Problem of this, long run time O(n4)
- - There are n2 pairs of vertices (i,j)
- - For each pair of vertices computing an
alignment takes O(n2) time. - This can be remedied by giving free rides
43Local Alignment Free Rides
Yeah, a free ride!
Vertex (0,0)
The dashed edges represent the free rides from
(0,0) to every other node.
44The Local Alignment Recurrence
- The largest value of si,j over the whole edit
graph is the score of the best local alignment. - The recurrence is shown below
0 si,j si-1,j-1 d
(vi, wj) max s i-1,j d (vi, -)
s i,j-1 d (-, wj)
45Affine Gap Penalties
- In nature, many times indels come as a unit, not
just at 1 nucleotide at a time.
ATA__GC ATATTGC
ATAG_GC AT_GTGC
Normal scoring would give the same score for both
alignments
46Accounting for Gaps
- Gaps- contiguous sequence of spaces in one of the
rows - Score for a gap of length x is -(? sx), where
? gt0 is the penalty for introducing a gap. ? will
be large relative to s because you do not want to
add too much of a penalty for extending the gap.
47Affine Gap Penalties and 3 Layer Manhattan Grid
- The three recurrences for the scoring algorithm
creates a 3-tiered graph. - The top level creates/extends gaps in the
sequence w. - The bottom level creates/extends gaps in sequence
v. - The middle level extends matches and mismatches.
48The 3 Grids
49The 3-leveled Manhattan Grid
Gaps in w
Matches/Mismatches
Gaps in v
50Affine Gap Penalty Recurrences
Continue Gap in w (deletion)
si,j s i-1,j - s max s
i-1,j (?s) si,j s i,j-1 - s
max s i,j-1 (?s) si,j
si-1,j-1 d (vi, wj) max s i,j
s i,j
Start Gap in w (deletion) from middle
Continue Gap in v (insertion)
Start Gap in v (insertion)from middle
Match or Mismatch
End deletion from top
End insertion from bottom