Sequence Alignment - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Sequence Alignment

Description:

Each of the two rows of the alignment is represented by a string of letters with ... Another way to represent each row shows the number of symbols of the sequence ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 51

Provided by: mch121

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignment

1
Sequence Alignment
2
Outline

Applying Manhattan Tourist Problem to sequence
comparison
Global Alignment
Scoring Matrices
Local Alignment
Alignment with Affine Gap Penalties

3
Align Two Strings

Given the strings of DNA
v ATGTTAT
w ATCGTAC
One Possible Alignment of the strings
AT_GTTAT_
ATCGT_A_C

4
Align Two Strings (contd)
Each of the two rows of the alignment is
represented by a string of letters with space
symbols, -

AT_GTTAT_
ATCGT_A_C

5
Align Two Strings (contd)
Another way to represent each row shows the
number of symbols of the sequence present up to a
given position. For example the above sequences
can be represented as

0 1 2 2 3 4 5 6 7 7
AT_GTTAT_ ATCGT_A_C
0 1 2 3 4 5 5 6 6 7
6
Alignment Matrix
Both rows of the alignment can be represented in
the resulting matrix
0 1 2 2 3 4 5 6 7 7
0 1 2 3 4 5 5 6 6 7
Each column in this matrix is a coordinate in a
two-dimensional nxm grid
7
Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1)
8
Alignment as a Path in the Edit Graph
1
0
2
3
4
5
6
7
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2)
0
1
2
3
4
5
6
7
9
Alignment as a Path in the Edit Graph
1
0
2
3
4
5
6
7
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4)
0
1
2
3
4
5
6
7
10
Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),
(7,7)
- End Result -
11
Alignments in Edit Graph (contd)

and represent indels in v and w
Score 0.
represent exact matches.
Score 1.

12
Alignments in Edit Graph (contd)
The score of the alignment path in the graph is
5.
13
Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an
alignment
14
Alignment as a Path in the Edit Graph
Old Alignment 0122345677 v AT_GTTAT_ w
ATCGT_A_C 0123455667
New Alignment 0122345677 v AT_GTTAT_ w
ATCG_TA_C 0123445667
15
Alignment as a Path in the Edit Graph
0122345677 v AT_GTTAT_ w ATCGT_A_C
0123455667 (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
16
Alignment Dynamic Programming
17
Dynamic Programming Example

There are no matches in the beginning of the
sequence
Label column i1 to be all zero, and row j1 to
be all zero

0
0
0
0
0
0
0
0
18
Dynamic Programming Example
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
?0 1, if vi wj ? value from top ? value from
left
1
1
1
1
1
1
19
Alignment Backtracking

Arrows show where the score
originated from.
if from the top
if from the left
if vi wj

20
Backtracking Example
Find a match in row and column 2. i2, j2,5 is
a match (T). j2, i4,5,7 is
a match (T). Since vi wj, S(i,j) Si-1,j-1
1 S(2,2) S(1,1) 1 1 S(2,5) S(1,4)
1 1 S(4,2) S(3,1) 1 1 S(5,2) S(4,1)
1 1 S(7,2) S(6,1) 1 1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
2
21
Backtracking Example
0
0
0
0
0
0
0
0
Continuing with the scoring algorithm gives this
result.
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
2
3
3
3
3
1
2
2
3
4
4
4
1
2
2
3
4
4
4
1
2
2
3
4
5
5
1
2
2
3
4
5
5
22
The LCS Problem

The previous example was a solution to the
Longest Common Subsequence (LCS) problemthe
simplest form of a sequence similarity analysis.
To solve the alignment we eliminate mismatches
and allow only insertions and deletions.

23
The LCS Problem (contd)

Find the longest subsequence common to two
strings.
Input Two strings, v and w.
Output The longest common subsequence of v
and w.

24
The LCS Recurrence

The score for vertex si,j is the same as in the
previous example

Si-1, j-11 if vi
wj Si,j max Si-1, j
Si, j-1
25
The LCS Recurrence Revisited

This can be rewritten by adding zero to the edges
that come from an indel, since the penalty of
indels are 0

Si-1, j-11 if vi
wj Si,j max Si-1, j 0
Si, j-1 0
26
LCS Algorithm

LCS(v,w)
for i ? 1 to n
Si,0 ? 0
for j ? 1 to m
S0,j ? 0
for i ? 1 to n
for j ? 1 to m
si-1,j
si,j ? max si,j-1
si-1,j-1 1, if vi wj
if si,j si-1,j
bi,j ? if si,j si,j-1
if si,j
si-1,j-1 1
return (sn,m, b)

27
Now What?

LCS(v,w) created the alignment grid
Now we need a way to read the best alignment of v
and w
Follow the arrows backwards from sink

28
Printing the LCS

PrintLCS(b,v,i,j)
if i 0 or j 0
return
if bi,j
PrintLCS(b,v,i-1,j-1)
print vi
else
if bi,j
PrintLCS(b,v,i-1,j)
else
PrintLCS(b,v,i,j-1)

29
LCS Runtime

To create the nxm matrix of best scores from
vertex (0,0) to all other vertices, it takes
O(nm) amount of time.
Why O(nm)? The pseudocode consists of a nested
for loop inside of another for loop to set up
a nxm matrix.
This sets up a value wj for every value vi.

30
Change up the Scoring

In the LCS Problem, we scored 1 from matches and
0 for indels in the alignment
Consider penalizing indels and mismatches with
negative scores

31
The Global Alignment Problem

Find the best alignment between two strings under
a given scoring matrix
Input Strings v w and a scoring matrix
Output Alignment of maximum score
?? -?
1 if matching
-µ if mismatching
si-1,j-1 1 if vi wj
si,j max s i-1,j-1 -µ if vi ? wj
s i-1,j - d
s i,j-1 - d

m mismatch d indel

32
Scoring Matrices

To generalize scoring, consider a (41) x(41)
scoring matrix d.
In the case of an amino acid alignment, the
scoring matrix would be a (201)x(201) size.
The addition of 1 is to include the score with
comparison of a gap character -.
This will simplify the scoring algorithm as
follows
si-1,j-1 d (vi, wj)
si,j max s i-1,j d (vi, -)
s i,j-1 d (-, wj)

33
Simple Scoring

When mismatches are penalized by some constant
µ, indels are penalized by some other constant
s, and matches are rewarded with 1, the
resulting score is
matches µ(mismatches) s (indels)

34
Making a Scoring Matrix

Scoring matrices are created based on biological
evidence.
Alignments can be thought of as two sequences
that differ due to mutations in the sequence.
Some of these mutations have little effect on the
organisms function, therefore some penalties,
d(vi , wj), will be less harsh than others.

35
Scoring Matrix Example

Notice that although R and K are different amino
acids, they have a positive score.
Why? They are both positively charged amino
acids? will not greatly change function of
protein.

36
The Blosum50 Scoring Matrix
37
Local vs. Global Alignment

The Global Alignment Problem tries to find the
longest path between vertices (0,0) and (n,m) in
the edit graph.
The Local Alignment Problem tries to find the
longest path among paths between arbitrary
vertices (i,j) and (i, j) in the edit graph.

38
Local vs. Global Alignment (contd)

Global Alignment
Local Alignmentbetter alignment to find
conserved segment

--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-G
AC

AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT-
-C
tccCAGTTATGTCAGgggacacgagcatgcagag
ac

aattgccgccgtcgttttcagCAGTTATGTCAGatc
39
Local Alignment Example
Local alignment
Global alignment
40
Local Alignments Why?

Some genes only have small conserved regions
between species of organisms
Example
Homeobox genes have a short region called the
homeodomain that is highly conserved between
species.
A global alignment would not find the homeodomain
because it would try to align the ENTIRE sequence

41
The Local Alignment Problem

Goal To find the best local alignment of strings
v and w
Input Strings v, w and scoring matrix d
Output alignment of substrings of v w whose
alignment score is maximum among all possible
alignment of all possible substrings

42
The Problem with this Problem

Problem of this, long run time O(n4)
- There are n2 pairs of vertices (i,j)
- For each pair of vertices computing an
alignment takes O(n2) time.
This can be remedied by giving free rides

43
Local Alignment Free Rides
Yeah, a free ride!
Vertex (0,0)
The dashed edges represent the free rides from
(0,0) to every other node.
44
The Local Alignment Recurrence

The largest value of si,j over the whole edit
graph is the score of the best local alignment.
The recurrence is shown below

0 si,j si-1,j-1 d
(vi, wj) max s i-1,j d (vi, -)
s i,j-1 d (-, wj)

45
Affine Gap Penalties

In nature, many times indels come as a unit, not
just at 1 nucleotide at a time.

ATA__GC ATATTGC
ATAG_GC AT_GTGC
Normal scoring would give the same score for both
alignments
46
Accounting for Gaps

Gaps- contiguous sequence of spaces in one of the
rows
Score for a gap of length x is -(? sx), where
? gt0 is the penalty for introducing a gap. ? will
be large relative to s because you do not want to
add too much of a penalty for extending the gap.

47
Affine Gap Penalties and 3 Layer Manhattan Grid

The three recurrences for the scoring algorithm
creates a 3-tiered graph.
The top level creates/extends gaps in the
sequence w.
The bottom level creates/extends gaps in sequence
v.
The middle level extends matches and mismatches.

48
The 3 Grids
49
The 3-leveled Manhattan Grid
Gaps in w
Matches/Mismatches
Gaps in v
50
Affine Gap Penalty Recurrences
Continue Gap in w (deletion)
si,j s i-1,j - s max s
i-1,j (?s) si,j s i,j-1 - s
max s i,j-1 (?s) si,j
si-1,j-1 d (vi, wj) max s i,j
s i,j
Start Gap in w (deletion) from middle
Continue Gap in v (insertion)
Start Gap in v (insertion)from middle
Match or Mismatch
End deletion from top
End insertion from bottom

Write a Comment

User Comments (0)