Sequence Alignment - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Sequence Alignment

Description:

Each of the two rows of the alignment is represented by a string of letters with ... Another way to represent each row shows the number of symbols of the sequence ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 51
Provided by: mch121
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
2
Outline
  • Applying Manhattan Tourist Problem to sequence
    comparison
  • Global Alignment
  • Scoring Matrices
  • Local Alignment
  • Alignment with Affine Gap Penalties

3
Align Two Strings
  • Given the strings of DNA
  • v ATGTTAT
  • w ATCGTAC
  • One Possible Alignment of the strings
  • AT_GTTAT_
  • ATCGT_A_C

4
Align Two Strings (contd)
Each of the two rows of the alignment is
represented by a string of letters with space
symbols, -
  • AT_GTTAT_
  • ATCGT_A_C

5
Align Two Strings (contd)
Another way to represent each row shows the
number of symbols of the sequence present up to a
given position. For example the above sequences
can be represented as

0 1 2 2 3 4 5 6 7 7
AT_GTTAT_ ATCGT_A_C
0 1 2 3 4 5 5 6 6 7
6
Alignment Matrix
Both rows of the alignment can be represented in
the resulting matrix
0 1 2 2 3 4 5 6 7 7
0 1 2 3 4 5 5 6 6 7
Each column in this matrix is a coordinate in a
two-dimensional nxm grid
7
Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1)
8
Alignment as a Path in the Edit Graph
1
0
2
3
4
5
6
7
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2)
0
1
2
3
4
5
6
7
9
Alignment as a Path in the Edit Graph
1
0
2
3
4
5
6
7
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4)
0
1
2
3
4
5
6
7
10
Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),
(7,7)
- End Result -
11
Alignments in Edit Graph (contd)
  • and represent indels in v and w
  • Score 0.
  • represent exact matches.
  • Score 1.

12
Alignments in Edit Graph (contd)
The score of the alignment path in the graph is
5.
13
Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an
alignment
14
Alignment as a Path in the Edit Graph
Old Alignment 0122345677 v AT_GTTAT_ w
ATCGT_A_C 0123455667
New Alignment 0122345677 v AT_GTTAT_ w
ATCG_TA_C 0123445667
15
Alignment as a Path in the Edit Graph
0122345677 v AT_GTTAT_ w ATCGT_A_C
0123455667 (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
16
Alignment Dynamic Programming
17
Dynamic Programming Example
  • There are no matches in the beginning of the
    sequence
  • Label column i1 to be all zero, and row j1 to
    be all zero

0
0
0
0
0
0
0
0
18
Dynamic Programming Example
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
?0 1, if vi wj ? value from top ? value from
left
1
1
1
1
1
1
19
Alignment Backtracking
  • Arrows show where the score
    originated from.
  • if from the top
  • if from the left
  • if vi wj

20
Backtracking Example
Find a match in row and column 2. i2, j2,5 is
a match (T). j2, i4,5,7 is
a match (T). Since vi wj, S(i,j) Si-1,j-1
1 S(2,2) S(1,1) 1 1 S(2,5) S(1,4)
1 1 S(4,2) S(3,1) 1 1 S(5,2) S(4,1)
1 1 S(7,2) S(6,1) 1 1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
2
21
Backtracking Example
0
0
0
0
0
0
0
0
Continuing with the scoring algorithm gives this
result.
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
2
2
3
3
3
3
1
2
2
3
4
4
4
1
2
2
3
4
4
4
1
2
2
3
4
5
5
1
2
2
3
4
5
5
22
The LCS Problem
  • The previous example was a solution to the
    Longest Common Subsequence (LCS) problemthe
    simplest form of a sequence similarity analysis.
  • To solve the alignment we eliminate mismatches
    and allow only insertions and deletions.

23
The LCS Problem (contd)
  • Find the longest subsequence common to two
    strings.
  • Input Two strings, v and w.
  • Output The longest common subsequence of v
    and w.

24
The LCS Recurrence
  • The score for vertex si,j is the same as in the
    previous example

Si-1, j-11 if vi
wj Si,j max Si-1, j
Si, j-1
25
The LCS Recurrence Revisited
  • This can be rewritten by adding zero to the edges
    that come from an indel, since the penalty of
    indels are 0

Si-1, j-11 if vi
wj Si,j max Si-1, j 0
Si, j-1 0
26
LCS Algorithm
  • LCS(v,w)
  • for i ? 1 to n
  • Si,0 ? 0
  • for j ? 1 to m
  • S0,j ? 0
  • for i ? 1 to n
  • for j ? 1 to m
  • si-1,j
  • si,j ? max si,j-1
  • si-1,j-1 1, if vi wj
  • if si,j si-1,j
  • bi,j ? if si,j si,j-1
  • if si,j
    si-1,j-1 1
  • return (sn,m, b)



27
Now What?
  • LCS(v,w) created the alignment grid
  • Now we need a way to read the best alignment of v
    and w
  • Follow the arrows backwards from sink

28
Printing the LCS
  • PrintLCS(b,v,i,j)
  • if i 0 or j 0
  • return
  • if bi,j
  • PrintLCS(b,v,i-1,j-1)
  • print vi
  • else
  • if bi,j
  • PrintLCS(b,v,i-1,j)
  • else
  • PrintLCS(b,v,i,j-1)

29
LCS Runtime
  • To create the nxm matrix of best scores from
    vertex (0,0) to all other vertices, it takes
    O(nm) amount of time.
  • Why O(nm)? The pseudocode consists of a nested
    for loop inside of another for loop to set up
    a nxm matrix.
  • This sets up a value wj for every value vi.

30
Change up the Scoring
  • In the LCS Problem, we scored 1 from matches and
    0 for indels in the alignment
  • Consider penalizing indels and mismatches with
    negative scores

31
The Global Alignment Problem
  • Find the best alignment between two strings under
    a given scoring matrix
  • Input Strings v w and a scoring matrix
  • Output Alignment of maximum score
  • ?? -?
  • 1 if matching
  • -µ if mismatching
  • si-1,j-1 1 if vi wj
  • si,j max s i-1,j-1 -µ if vi ? wj
  • s i-1,j - d
  • s i,j-1 - d

m mismatch d indel

32
Scoring Matrices
  • To generalize scoring, consider a (41) x(41)
    scoring matrix d.
  • In the case of an amino acid alignment, the
    scoring matrix would be a (201)x(201) size.
    The addition of 1 is to include the score with
    comparison of a gap character -.
  • This will simplify the scoring algorithm as
    follows
  • si-1,j-1 d (vi, wj)
  • si,j max s i-1,j d (vi, -)
  • s i,j-1 d (-, wj)


33
Simple Scoring
  • When mismatches are penalized by some constant
    µ, indels are penalized by some other constant
    s, and matches are rewarded with 1, the
    resulting score is
  • matches µ(mismatches) s (indels)

34
Making a Scoring Matrix
  • Scoring matrices are created based on biological
    evidence.
  • Alignments can be thought of as two sequences
    that differ due to mutations in the sequence.
  • Some of these mutations have little effect on the
    organisms function, therefore some penalties,
    d(vi , wj), will be less harsh than others.

35
Scoring Matrix Example
  • Notice that although R and K are different amino
    acids, they have a positive score.
  • Why? They are both positively charged amino
    acids? will not greatly change function of
    protein.

36
The Blosum50 Scoring Matrix
37
Local vs. Global Alignment
  • The Global Alignment Problem tries to find the
    longest path between vertices (0,0) and (n,m) in
    the edit graph.
  • The Local Alignment Problem tries to find the
    longest path among paths between arbitrary
    vertices (i,j) and (i, j) in the edit graph.

38
Local vs. Global Alignment (contd)
  • Global Alignment
  • Local Alignmentbetter alignment to find
    conserved segment

--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-G
AC


AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT-
-C
tccCAGTTATGTCAGgggacacgagcatgcagag
ac

aattgccgccgtcgttttcagCAGTTATGTCAGatc
39
Local Alignment Example
Local alignment
Global alignment
40
Local Alignments Why?
  • Some genes only have small conserved regions
    between species of organisms
  • Example
  • Homeobox genes have a short region called the
    homeodomain that is highly conserved between
    species.
  • A global alignment would not find the homeodomain
    because it would try to align the ENTIRE sequence

41
The Local Alignment Problem
  • Goal To find the best local alignment of strings
    v and w
  • Input Strings v, w and scoring matrix d
  • Output alignment of substrings of v w whose
    alignment score is maximum among all possible
    alignment of all possible substrings

42
The Problem with this Problem
  • Problem of this, long run time O(n4)
  • - There are n2 pairs of vertices (i,j)
  • - For each pair of vertices computing an
    alignment takes O(n2) time.
  • This can be remedied by giving free rides

43
Local Alignment Free Rides
Yeah, a free ride!
Vertex (0,0)
The dashed edges represent the free rides from
(0,0) to every other node.
44
The Local Alignment Recurrence
  • The largest value of si,j over the whole edit
    graph is the score of the best local alignment.
  • The recurrence is shown below

0 si,j si-1,j-1 d
(vi, wj) max s i-1,j d (vi, -)
s i,j-1 d (-, wj)

45
Affine Gap Penalties
  • In nature, many times indels come as a unit, not
    just at 1 nucleotide at a time.

ATA__GC ATATTGC
ATAG_GC AT_GTGC
Normal scoring would give the same score for both
alignments
46
Accounting for Gaps
  • Gaps- contiguous sequence of spaces in one of the
    rows
  • Score for a gap of length x is -(? sx), where
    ? gt0 is the penalty for introducing a gap. ? will
    be large relative to s because you do not want to
    add too much of a penalty for extending the gap.

47
Affine Gap Penalties and 3 Layer Manhattan Grid
  • The three recurrences for the scoring algorithm
    creates a 3-tiered graph.
  • The top level creates/extends gaps in the
    sequence w.
  • The bottom level creates/extends gaps in sequence
    v.
  • The middle level extends matches and mismatches.

48
The 3 Grids
49
The 3-leveled Manhattan Grid
Gaps in w
Matches/Mismatches
Gaps in v
50
Affine Gap Penalty Recurrences
Continue Gap in w (deletion)
si,j s i-1,j - s max s
i-1,j (?s) si,j s i,j-1 - s
max s i,j-1 (?s) si,j
si-1,j-1 d (vi, wj) max s i,j
s i,j
Start Gap in w (deletion) from middle
Continue Gap in v (insertion)
Start Gap in v (insertion)from middle
Match or Mismatch
End deletion from top
End insertion from bottom
Write a Comment
User Comments (0)
About PowerShow.com