CS 5263 Bioinformatics - PowerPoint PPT Presentation

1 / 197
About This Presentation
Title:

CS 5263 Bioinformatics

Description:

Lectures 3-6: Pair-wise Sequence Alignment ... – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 198
Provided by: Jian144
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics


1
CS 5263 Bioinformatics
  • Lectures 3-6 Pair-wise Sequence Alignment

2
Outline
  • Part I Algorithms
  • Biological problem
  • Intro to dynamic programming
  • Global sequence alignment
  • Local sequence alignment
  • More efficient algorithms
  • Part II Biological issues
  • Model gaps more accurately
  • Alignment statistics
  • Part III BLAST

3
Evolution at the DNA level
C
ACGGTGCAGTCACCA
ACGTTGC-GTCCACCA
DNA evolutionary events (sequence
edits) Mutation, deletion, insertion
4
Sequence conservation implies function



next generation
OK



OK



OK



X



X



Still OK?



5
Why sequence alignment?
  • Conserved regions are more likely to be
    functional
  • Can be used for finding genes, regulatory
    elements, etc.
  • Similar sequences often have similar origin and
    function
  • Can be used to predict functions for new genes /
    proteins
  • Sequence alignment is one of the most widely used
    computational tools in biology

6
Global Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
S
T
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
S
T
  • Definition
  • An alignment of two strings S, T is a pair of
    strings S, T (with spaces) s.t.
  • S T, and (S length of S)
  • removing all spaces in S, T leaves S, T

7
What is a good alignment?
  • Alignment
  • The best way to match the letters of one
    sequence with those of the other
  • How do we define best?

8
S -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
  • The score of aligning (characters or spaces) x
    y is s (x,y).
  • Score of an alignment
  • An optimal alignment one with max score

9
Scoring Function
  • Sequence edits
  • AGGCCTC
  • Mutations AGGACTC
  • Insertions AGGGCCTC
  • Deletions AGG-CTC
  • Scoring Function
  • Match m AAC
  • Mismatch -s A-A
  • Gap (indel) -d

10
  • Match 2, mismatch -1, gap -1
  • Score 3 x 2 2 x 1 1 x 1 3

11
More complex scoring function
  • Substitution matrix
  • Similarity score of matching two letters a, b
    should reflect the probability of a, b derived
    from the same ancestor
  • It is usually defined by log likelihood ratio
  • Active research area. Especially for proteins.
  • Commonly used PAM, BLOSUM

12
An example substitution matrix
A C G T
A 3 -2 -1 -2
C 3 -2 -1
G 3 -2
T 3
13
How to find an optimal alignment?
  • A naïve algorithm
  • for all subseqs A of S, B of T s.t. A B do
  • align Ai with Bi, 1 i A
  • align all other chars to spaces
  • compute its value
  • retain the max
  • end
  • output the retained alignment

S abcd A cd T wxyz B xz -abc-d
a-bc-d w--xyz -w-xyz
14
Analysis
  • Assume S T n
  • Cost of evaluating one alignment n
  • How many alignments are there
  • pick n chars of S,T together
  • say k of them are in S
  • match these k to the k unpicked chars of T
  • Total time
  • E.g., for n 20, time is gt 240 gt1012 operations

15
  • Intro to Dynamic Programming

16
Dynamic programming
  • What is dynamic programming?
  • A method for solving problems exhibiting the
    properties of overlapping subproblems and optimal
    substructure
  • Key idea tabulating sub-problem solutions rather
    than re-computing them repeatedly
  • Two simple examples
  • Computing Fibonacci numbers
  • Find the special shortest path in a grid

17
Example 1 Fibonacci numbers
  • 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89,
  • F(0) 1
  • F(1) 1
  • F(n) F(n-1) f(n-2)
  • How to compute F(n)?

18
A recursive algorithm
  • function fib(n)
  • if (n 0 or n 1) return 1
  • else return fib(n-1) fib(n-2)

19
n/2
n
  • Time complexity
  • Between 2n/2 and 2n
  • O(1.62n), i.e. exponential
  • Why recursive Fib algorithm is inefficient?
  • Overlapping subproblems

20
An iterative algorithm
  • function fib(n)
  • F0 1 F1 1
  • for i 2 to n
  • Fi Fi-1 Fi-2
  • Return Fn

Time complexity Time O(n), space O(n)
21
Example 2 shortest path in a grid
S










m
G
n
Each edge has a length (cost). We need to get to
G from S. Can only move right or down. Aim find
a path with the minimum total length
22
Optimal substructures
  • Naïve algorithm enumerate all possible paths and
    compare costs
  • Exponential number of paths
  • Key observation
  • If a path P(S, G) is the shortest from S to G,
    any of its sub-path P(S,x), where x is on P(S,G),
    is the shortest from S to x

23
Proof
  • Proof by contradiction
  • If the path between P(S,x) is not the shortest,
    i.e., P(S,x) lt P(S,x)
  • Construct a new path P(S,G) P(S,x) P(x, G)
  • P(S,G) lt P(S,G) gt P(S,G) is not the shortest
  • Contradiction
  • Therefore, P(S, x) is the shortest

S










x
G
24
Recursive solution
(0,0)
  • Index each intersection by two indices, (i, j)
  • Let F(i, j) be the total length of the shortest
    path from (0, 0) to (i, j). Therefore, F(m, n) is
    the shortest path we wanted.
  • To compute F(m, n), we need to compute both
    F(m-1, n) and F(m, n-1)











m
(m, n)
n
F(m-1, n) length((m-1, n), (m, n)) F(m,
n) min F(m, n-1) length((m,
n-1), (m, n))
25
Recursive solution
F(i-1, j) length((i-1, j), (i, j)) F(i, j)
min F(i, j-1) length((i, j-1), (i, j))
(0,0)
  • But if we use recursive call, many subpaths will
    be recomputed for many times
  • Strategy pre-compute F values starting from the
    upper-left corner. Fill in row by row (what other
    order will also do?)











(i-1, j)
(i, j)
(i, j-1)
m
(m, n)
n
26
Dynamic programming illustration
S
9
1
2
3
3
12
13
15
0




5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
G
F(i-1, j) length(i-1, j, i, j) F(i, j)
min F(i, j-1) length(i, j-1, i, j)
27
Trackback
9
1
2
3
3
12
13
15
0




5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
28
Elements of dynamic programming
  • Optimal sub-structures
  • Optimal solutions to the original problem
    contains optimal solutions to sub-problems
  • Overlapping sub-problems
  • Some sub-problems appear in many solutions
  • Memorization and reuse
  • Carefully choose the order that sub-problems are
    solved

29
Dynamic Programming for sequence alignment
  • Suppose we wish to align
  • x1xM
  • y1yN
  • Let F(i,j) optimal score of aligning
  • x1xi
  • y1yj
  • Scoring Function
  • Match m
  • Mismatch -s
  • Gap (indel) -d

30
Elements of dynamic programming
  • Optimal sub-structures
  • Optimal solutions to the original problem
    contains optimal solutions to sub-problems
  • Overlapping sub-problems
  • Some sub-problems appear in many solutions
  • Memorization and reuse
  • Carefully choose the order that sub-problems are
    solved

31
Optimal substructure
  • If xi is aligned to yj in the optimal
    alignment between x1..M and y1..N, then
  • The alignment between x1..i and y1..j is also
    optimal
  • Easy to prove by contradiction

32
Recursive solution
  • Notice three possible cases
  • xM aligns to yN
  • xM
  • yN
  • 2. xM aligns to a gap
  • xM
  • ?
  • yN aligns to a gap
  • ?
  • yN

m, if xM yN F(M,N)
F(M-1, N-1) -s, if not
F(M,N) F(M-1, N) - d
F(M,N) F(M, N-1) - d
33
Recursive solution
  • Generalize
  • F(i-1, j-1) ?(Xi,Yj)
  • F(i,j) max F(i-1, j) d
  • F(i, j-1) d
  • ?(Xi,Yj) m if Xi Yj, and s otherwise
  • Boundary conditions
  • F(0, 0) 0.
  • F(0, j) ?
  • F(i, 0) ?

-jd y1..j aligned to gaps.
-id x1..i aligned to gaps.
34
What order to fill?
F(0,0)





F(M,N)
i
j
35
What order to fill?
F(0,0)





F(M,N)
36
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A

A
T
A
j 0
1
2
3
37
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1
T -2
A -3
j 0
1
2
3
38
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2
A -3
j 0
1
2
3
39
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3
j 0
1
2
3
40
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2
j 0
1
2
3
41
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 This only tells us
the best score
j 0
1
2
3
42
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
A
A
2
3
43
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
T A
T A
2
3
44
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
G T A
- T A
2
3
45
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
A G T A
A - T A
2
3
46
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
47
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1
T -2
A -3
j 0
1
2
3
48
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2
A -3
j 0
1
2
3
49
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3
j 0
1
2
3
50
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
51
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
52
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
53
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
54
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
55
The Needleman-Wunsch Algorithm
  • Initialization.
  • F(0, 0) 0
  • F(0, j) - j ? d
  • F(i, 0) - i ? d
  • Main Iteration. Filling in scores
  • For each i 1M
  • For each j 1N
  • F(i-1,j) d case 1
  • F(i, j) max F(i, j-1) d
    case 2
  • F(i-1, j-1) s(xi, yj) case 3
  • UP, if case 1
  • Ptr(i,j) LEFT if case 2
  • DIAG if case 3
  • Termination. F(M, N) is the optimal score, and
    from Ptr(M, N) can trace back optimal alignment

56
Complexity
  • Time
  • O(NM)
  • Space
  • O(NM)
  • Linear-space algorithms do exist (with the same
    time complexity)

57
Equivalent graph problem
S1
G
A
T
A
(0,0)



? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)
  • Number of steps length of the alignment
  • Path length alignment score
  • Optimal alignment find the longest path from (0,
    0) to (3, 4)
  • General longest path problem cannot be found with
    DP. Longest path on this graph can be found by DP
    since no cycle is possible.

58
Question
  • If we change the scoring scheme, will the optimal
    alignment be changed?
  • Old Match 1, mismatch gap -1
  • New match 2, mismatch gap 0
  • New Match 2, mismatch gap -2?

59
Question
  • What kind of alignment is represented by these
    paths?

A
A
A
A
A










B C
B C
B C
B C
B C
A- BC
A-- -BC
--A BC-
-A- B-C
-A BC
Alternating gaps are impossible if s gt -2d
60
A variant of the basic algorithm
  • Scoring scheme m s d 1
  • Seq1 CAGCA-CTTGGATTCTCGG
  • Seq2 ---CAGCGTGG--------
  • Seq1 CAGCACTTGGATTCTCGG
  • Seq2 CAGC-----G-T----GG
  • The first alignment may be biologically more
    realistic in some cases (e.g. if we know s2 is a
    subsequence of s1)

Score -7
Score -2
61
A variant of the basic algorithm
  • Maybe it is OK to have an unlimited of gaps in
    the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------
  • Then, we dont want to penalize gaps in the ends

62
The Overlap Detection variant
  • Changes
  • Initialization
  • For all i, j,
  • F(i, 0) 0
  • F(0, j) 0
  • Termination
  • maxi F(i, N)
  • FOPT max maxj F(M, j)

x1 xM
yN y1
63
Different types of overlaps
x
x
y
y
64
The local alignment problem
  • Given two strings X x1xM,
  • Y y1yN
  • Find substrings x, y whose similarity (optimal
    global alignment value) is maximum
  • e.g. X abcxdex X cxde
  • Y xxxcde Y c-de

x
y
65
Why local alignment
  • Conserved regions may be a small part of the
    whole
  • Global alignment might miss them if flanking
    junk outweighs similar regions
  • Genes are shuffled between genomes

C
D
B
A
D
A
B
C
66
Naïve algorithm
  • for all substrings X of X and Y of Y
  • Align X Y via dynamic programming
  • Retain pair with max value
  • end
  • Output the retained pair
  • Time O(n2) choices for A, O(m2) for B, O(nm) for
    DP, so O(n3m3 ) total.

67
Reminder
  • The overlap detection algorithm
  • We do not give penalty to gaps at either end

Free gap
Free gap
68
The local alignment idea
  • Do not penalize the unaligned regions (gaps or
    mismatches)
  • The alignment can start anywhere and ends
    anywhere
  • Strategy whenever we get to some low similarity
    region (negative score), we restart a new
    alignment
  • By resetting alignment score to zero

69
The Smith-Waterman algorithm
  • Initialization F(0, j) F(i, 0) 0
  • 0
  • F(i 1, j) d
  • F(i, j 1) d
  • F(i 1, j 1) ?(xi, yj)

Iteration F(i, j) max
70
The Smith-Waterman algorithm
  • Termination
  • If we want the best local alignment
  • FOPT maxi,j F(i, j)
  • If we want all local alignments scoring gt t
  • For all i, j find F(i, j) gt t, and trace back

71
x x x c d e
0 0 0 0 0 0 0
a 0
b 0
c 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
72
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
73
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
74
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
75
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0
x 0
Match 2 Mismatch -1 Gap -1
76
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0
Match 2 Mismatch -1 Gap -1
77
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
78
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
79
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
cxde c-de
x-de xcde
80
  • No negative values in local alignment DP array
  • Optimal local alignment will never have a gap on
    either end
  • Local alignment Smith-Waterman
  • Global alignment Needleman-Wunsch

81
Analysis
  • Time
  • O(MN) for finding the best alignment
  • Time to report all alignments depends on the
    number of sub-opt alignments
  • Memory
  • O(MN)
  • O(MN) possible

82
  • More efficient alignment algorithms

83
  • Given two sequences of length M, N
  • Time O(MN)
  • Ok, but still slow for long sequences
  • Space O(MN)
  • bad
  • 1Mb seq x 1Mb seq 1TB memory
  • Can we do better?

84
Bounded alignment
  • Good alignment should appear near the diagonal





















85
Bounded Dynamic Programming
  • If we know that x and y are very similar
  • Assumption gaps(x, y) lt k
  • xi
  • Then, implies i j lt k
  • yj

86
Bounded Dynamic Programming
  • Initialization
  • F(i,0), F(0,j) undefined for i, j gt k
  • Iteration
  • For i 1M
  • For j max(1, i k)min(N, ik)
  • F(i 1, j 1) ?(xi, yj)
  • F(i, j) max F(i, j 1) d, if j gt i k
  • F(i 1, j) d, if j lt i k
  • Termination same

x1 xM
yN y1
k
87
Analysis
  • Time O(kM) ltlt O(MN)
  • Space O(kM) with some tricks

gt
M
M
2k
2k
88




(No Transcript)
89
  • Given two sequences of length M, N
  • Time O(MN)
  • ok
  • Space O(MN)
  • bad
  • 1mb seq x 1mb seq 1TB memory
  • Can we do better?

90
Linear space algorithm
  • If all we need is the alignment score but not the
    alignment, easy!

We only need to keep two rows (You only need one
row, with a little trick)










But how do we get the alignment?
91
Linear space algorithm
  • When we finish, we know how we have aligned the
    ends of the sequences

XM YN










Naïve idea Repeat on the smaller subproblem
F(M-1, N-1) Time complexity O((MN)(MN))
92
(0, 0)
M/2
(M, N)
Key observation optimal alignment (longest path)
must use an intermediate point on the M/2-th row.
Call it (M/2, k), where k is unknown.
93
(0,0)






(3,2)
(3,4)
(3,6)
(3,0)
(6,6)
  • Longest path from (0, 0) to (6, 6) is max_k
    (LP(0,0,3,k) LP(6,6,3,k))

94
Hirschbergs idea
  • Divide and conquer!

Y










Forward algorithm Align x1x2xM/2 with Y
X
M/2
F(M/2, k) represents the best alignment between
x1x2xM/2 and y1y2yk
95
Backward Algorithm
Y










Backward algorithm Align reverse(xM/21xM) with
reverse(Y)
X
M/2
B(M/2, k) represents the best alignment between
reverse(xM/21xM) and reverse(ykyk1yN )
96
Linear-space alignment
  • Using 2 (4) rows of space, we can compute
  • for k 1N, F(M/2, k), B(M/2, k)





















M/2
97
Linear-space alignment
  • Now, we can find k maximizing F(M/2, k) B(M/2,
    k)
  • Also, we can trace the path exiting column M/2
    from k











Conclusion In O(NM) time, O(N) space, we found
optimal alignment path at row M/2
98
Linear-space alignment
  • Iterate this procedure to the two sub-problems!

M/2
k
M/2
N-k
99
Analysis
  • Memory O(N) for computation, O(NM) to store the
    optimal alignment
  • Time
  • MN for first iteration
  • k M/2 (N-k) M/2 MN/2 for second

k
M/2
M/2
N-k
100
MN
MN/2
MN/4
MN MN/2 MN/4 MN/8 MN (1 ½ ¼
1/8 1/16 ) 2MN O(MN)
MN/8
101
Outline
  • Part I Algorithms
  • Biological problem
  • Intro to dynamic programming
  • Global sequence alignment
  • Local sequence alignment
  • More efficient algorithms
  • Part II Biological issues
  • Model gaps more accurately
  • Alignment statistics
  • Part III BLAST

102
  • How to model gaps more accurately

103
Whats a better alignment?
  • GACGCCGAACG
  • GACGC---ACG
  • GACGCCGAACG
  • GACG-C-A-CG

Score 8 x m 3 x d
Score 8 x m 3 x d
  • However, gaps usually occur in bunches.
  • During evolution, chunks of DNA may be lost
    entirely
  • Aligning genomic sequences vs. cDNAs (reverse
    complimentary to mRNAs)

104
Model gaps more accurately
  • Current model
  • Gap of length n incurs penalty n?d
  • General
  • Convex function
  • E.g. ?(n) c sqrt (n)

?
n
?
n
105
General gap dynamic programming





  • Initialization same
  • Iteration
  • F(i-1, j-1) s(xi, yj)
  • F(i, j) max maxk0i-1F(k,j) ?(i-k)
  • maxk0j-1F(i,k) ?(j-k)
  • Termination same
  • Running Time O((MN)MN) (cubic)
  • Space O(NM) (linear-space algorithm not
    applicable)

106
Compromise affine gaps
?(n)
  • ?(n) d (n 1)?e
  • gap gap
  • open extension

e
d
Match 2 Gap open -5 Gap extension -1
GACGCCGAACG GACGC---ACG
GACGCCGAACG GACG-C-A-CG
8x2-5-2 9
8x2-3x5 1
  • We want to find the optimal alignment with affine
    gap penalty in
  • O(MN) time
  • O(MN) or better O(MN) memory

107
Allowing affine gap penalties
  • Still three cases
  • xi aligned with yj
  • xi aligned to a gap
  • Are we continuing a gap in x? (if no, start is
    more expensive)
  • yj aligned to a gap
  • Are we continuing a gap in y? (if no, start is
    more expensive)
  • We can use a finite state machine to represent
    the three cases as three states
  • The machine has two heads, reading the chars on
    the two strings separately
  • At every step, each head reads 0 or 1 char from
    each sequence
  • Depending on what it reads, goes to a different
    state, and produces different scores

108
Finite State Machine
Input
Output
? / ?
? / ?
Ix
? / ?
? / ?
F
? / ?
Iy
? / ?
State
? / ?
F have just read 1 char from each seq (xi
aligned to yj ) Ix have read 0 char from x. (yj
aligned to a gap) Iy have read 0 char from y (xi
aligned to a gap)
109
Input
Output
(-, yj) / e
Ix
(xi,yj) / ?
(xi,yj) / ?
(-, yj) / d
F
(xi,-) / d
Iy
(xi,-) / e
Start state
(xi,yj) / ?
Current state Input Output Next state
F (xi,yj) ? F
F (-,yj) d Ix
F (xi,-) d Iy
Ix (-,yj) e Ix

110
F-F-Iy-F-Ix
F-F-F-F
F-Iy-F-F-Ix
AAC ACT
AAC- A-CT
AAC- -ACT
AAC ACT
Given a pair of sequences, an alignment (not
necessarily optimal) corresponds to a state path
in the FSM. Optimal alignment find a state path
to read the two sequences such that the total
output score is the highest
111
Dynamic programming
  • We encode this information in three different
    matrices
  • For each element (i,j) we use three variables
  • F(i,j) best alignment (score) of x1..xi y1..yj
    if xi aligns to yj
  • Ix(i,j) best alignment of x1..xi y1..yj if yj
    aligns to gap
  • Iy(i,j) best alignment of x1..xi y1..yj if xi
    aligns to gap

xi
xi
xi
yj
yj
yj
Iy(i, j)
Ix(i, j)
F(i, j)
112
(-, yj)/e
Ix
(xi,yj) /?
(xi,yj) /?
(-, yj) /d
F
(xi,-) /d
Iy
(xi,-)/e
(xi,yj) /?
F(i-1, j-1) ?(xi, yj) F(i, j) max Ix(i-1,
j-1) ?(xi, yj) Iy(i-1, j-1) ?(xi, yj)
xi
yj
113
(-, yj)/e
Ix
(xi,yj) /?
(xi,yj) /?
(-, yj) /d
F
(xi,-) /d
Iy
(xi,-)/e
(xi,yj) /?
F(i, j-1) d Ix(i, j) max Ix(i, j-1)
e
xi
yj
Ix(i, j)
114
(-, yj)/e
Ix
(xi,yj) /?
(xi,yj) /?
(-, yj) /d
F
(xi,-) /d
Iy
(xi,-)/e
(xi,yj) /?
F(i-1, j) d Iy(i, j) max Iy(i-1, j)
e
xi
yj
Iy(i, j)
115
  • F(i 1, j 1)
  • F(i, j) ?(xi, yj) max Ix(i 1, j 1)
  • Iy(i 1, j 1)
  • F(i, j 1) d
  • Ix(i, j) max
  • Ix(i, j 1) e
  • F(i 1, j) d
  • Iy(i, j) max
  • Iy(i 1, j) e

Continuing alignment
Closing gaps in x
Closing gaps in y
Opening a gap in x
Gap extension in x
Opening a gap in y
Gap extension in y
116
Data dependency






F
i
Iy
Ix
j












i-1
i-1
j-1
j-1
117
Data dependency






F
i
Iy
Ix
j












i
i
j
j
118
Data dependency
  • If we stack all three matrices
  • No cyclic dependency
  • Therefore, we can fill in all three matrices in
    order



















119
Algorithm
  • for i 1m
  • for j 1n
  • Fill in F(i, j), Ix(i, j), Iy(i, j)
  • end
  • end
  • F(M, N) max (F(M, N), Ix(M, N), Iy(M, N))
  • Time O(MN)
  • Space O(MN) or O(N) when combined with the
    linear-space algorithm

120
Exercise
  • x GCAC
  • y GCC
  • m 2
  • s -2
  • d -5
  • e -1

121
y
y
G C C
G C C
0 -? -? -?
-?
-?
-?
-?
x
x
-? -? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F aligned on both
Iy Insertion on y
G C C
y
F(i-1, j-1)
Iy(i-1, j-1)
-? -5 -6 -7
-?
-?
-?
-?
Iy(i-1,j)
x
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix Insertion on x
122
y
y
G C C
G C C
0 -? -? -?
-? 2
-?
-?
-?
x
x
-? -? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
G C C
y
-? -5 -6 -7
-?
-?
-?
-?
x
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) 2
Ix(i-1, j-1)
F(i, j)
Ix
123
y
y
G C C
G C C
0 -? -? -?
-? 2 -7
-?
-?
-?
x
x
-? -? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
G C C
y
-? -5 -6 -7
-?
-?
-?
-?
x
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) -2
Ix(i-1, j-1)
F(i, j)
Ix
124
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-?
-?
-?
x
x
-? -? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
G C C
y
-? -5 -6 -7
-?
-?
-?
-?
x
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) -2
Ix(i-1, j-1)
F(i, j)
Ix
125
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-?
-?
-?
x
x
-? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3
-?
-?
-?
G C A C
F(i,j-1)
d -5
Ix(i,j)
Ix(i,j-1)
e -1
Ix
126
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-?
-?
-?
x
x
-? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
G C A C
F(i,j-1)
d -5
Ix(i,j)
Ix(i,j-1)
e -1
Ix
127
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-?
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
Iy(i-1,j)
G C A C
F(i-1,j)
e-1
d-5
Iy(i,j)
Ix
128
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) -2
Ix(i-1, j-1)
F(i, j)
Ix
129
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) 2
Ix(i-1, j-1)
F(i, j)
Ix
130
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) 2
Ix(i-1, j-1)
F(i, j)
Ix
131
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-?
-?
G C A C
F(i,j-1)
d -5
Ix(i,j)
Ix(i,j-1)
e -1
Ix
132
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-?
-?
Iy(i-1,j)
G C A C
F(i-1,j)
e-1
d-5
Iy(i,j)
Ix
133
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
F(i-1, j-1)
Iy(i-1, j-1)
Iy(i-1,j)
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-?
-?
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix
134
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -5
-? -8 -5 2
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
F(i-1, j-1)
Iy(i-1, j-1)
Iy(i-1,j)
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-?
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix
135
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-?
Iy(i-1,j)
G C A C
F(i-1,j)
e-1
d-5
Iy(i,j)
Ix
136
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1 -6
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-?
Iy(i-1,j)
G C A C
F(i-1,j)
e-1
d-5
Iy(i,j)
Ix
137
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-? -9 -6 1
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
F(i-1, j-1)
Iy(i-1, j-1)
Iy(i-1,j)
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-? -? -14 -11
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix
138
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-? -9 -6 1
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
F(i-1, j-1)
Iy(i-1, j-1)
Iy(i-1,j)
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-? -? -14 -11
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix
139
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-? -9 -6 1
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
m 2 s -2 d -5 e -1
G C A C
G C A C
GCAC GC-C
x
y
F
Iy
y
G C C
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-? -? -14 -11










x
G C A C
G C A C
Ix
140
Statistics of alignment
  • Where does ?(xi, yj) come from?
  • Are two aligned sequences actually related?

141
Probabilistic model of alignments
  • Well first focus on protein alignments without
    gaps
  • Given an alignment, we can consider two possible
    models
  • R the sequences are related by evolution
  • U the sequences are unrelated
  • How can we distinguish these two models?
  • How is this view related to amino-acid
    substitution matrix?

142
Model for unrelated sequences
  • Assume each position of the alignment is
    independently sampled from some distribution of
    amino acids
  • ps probability of amino acid s in the sequences
  • Probability of seeing an amino acid s aligned to
    an amino acid t by chance is
  • Pr(s, t U) ps pt
  • Probability of seeing an ungapped alignment
    between x x1xn and y y1yn randomly is

i
143
Model for related sequences
  • Assume each pair of aligned amino acids evolved
    from a common ancestor
  • Let qst be the probability that amino acid s in
    one sequence is related to t in another sequence
  • The probability of an alignment of x and y is
    give by

144
Probabilistic model of Alignments
  • How can we decide which model (U or R) is more
    likely?
  • One principled way is to consider the relative
    likelihood of the two models (the odds ratio)
  • A higher ratio means that R is more likely than U

145
Log odds ratio
  • Taking logarithm, we get
  • Recall that the score of an alignment is given by



146
  • Therefore, if we define
  • We are actually defining the alignment score as
    the log odds ratio between the two models R and U

147
How to get the probabilities?
  • ps can be counted from the available protein
    sequences
  • But how do we get qst? (the probability that s
    and t have a common ancestor)
  • Counted from trusted alignments of related
    sequences

148
Protein Substitution Matrices
  • Two popular sets of matrices for protein
    sequences
  • PAM matrices Dayhoff et al, 1978
  • Better for aligning closely related sequences
  • BLOSUM matrices Henikoff Henikoff, 1992
  • For both closely or remotely related sequences

149
BLOSUM-N matrices
  • Constructed from a database called BLOCKS
  • Contain many closely related sequences
  • Conserved amino acids may be over-counted
  • N 62 the probabilities qst were computed using
    trusted alignments with no more than 62 identity
  • identity of matched columns
  • Using this matrix, the Smith-Waterman algorithm
    is most effective in detecting real alignments
    with a similar identity level (i.e. 62)

150
? Scaling factor to convert score to
integer. Important when you are told that
a scoring matrix is in half-bits gt ? ½ ln2
Positive for chemically similar substitution
Common amino acids get low weights
Rare amino acids get high weights
151
BLOSUM-N matrices
  • If you want to detect homologous genes with high
    identity, you may want a BLOSUM matrix with
    higher N. say BLOSUM75
  • On the other hand, if you want to detect remote
    homology, you may want to use lower N, say
    BLOSUM50
  • BLOSUM-62 good for most purposes

152
For DNAs
  • No database of trusted alignments to start with
  • Specify the percentage identity you would like to
    detect
  • You can then get the substitution matrix by some
    calculation

153
For example
  • Suppose pA pC pT pG 0.25
  • We want 88 identity
  • qAA qCC qTT qGG 0.22, the rest 0.12/12
    0.01
  • ?(A, A) ?(C, C) ?(G, G) ?(T, T)
  • log (0.22 / (0.250.25)) 1.26
  • ?(s, t) log (0.01 / (0.250.25)) -1.83 for s
    ? t.

154
Substitution matrix
A C G T
A 1.26 -1.83 -1.83 -1.83
C -1.83 1.26 -1.83 -1.83
G -1.83 -1.83 1.26 -1.83
T -1.83 -1.83 -1.83 1.26
155
A C G T
A 5 -7 -7 -7
C -7 5 -7 -7
G -7 -7 5 -7
T -7 -7 -7 5
  • Scale wont change the alignment
  • Multiply by 4 and then round off to get integers

156
Arbitrary substitution matrix
  • Say you have a substitution matrix provided by
    someone
  • Its important to know what you are actually
    looking for when you use the matrix

157
NCBI-BLAST
WU-BLAST
A C G T
A 1 -2 -2 -2
C -2 1 -2 -2
G -2 -2 1 -2
T -2 -2 -2 1
A C G T
A 5 -4 -4 -4
C -4 5 -4 -4
G -4 -4 5 -4
T -4 -4 -4 5
  • Whats the difference?
  • Which one should I use for my sequences?

158
  • We had
  • Scale it, so that
  • Reorganize

159
  • Since all probabilities must sum to 1,
  • We have
  • Suppose again ps 0.25 for any s
  • We know ?(s, t) from the substitution matrix
  • We can solve the equation for ?
  • Plug ? into to
    get qst

160
NCBI-BLAST
WU-BLAST
A C G T
A 1 -2 -2 -2
C -2 1 -2 -2
G -2 -2 1 -2
T -2 -2 -2 1
A C G T
A 5 -4 -4 -4
C -4 5 -4 -4
G -4 -4 5 -4
T -4 -4 -4 5
? 1.33 qst 0.24 for s t, and 0.004 for s ?
t Translate 95 identity
? 0.19 qst 0.16 for s t, and 0.03 for s ?
t Translate 65 identity
161
Details for solving ?
  • Known ?(s,t) 1 for st, and ?(s,t) -2 for
    s?? t
Write a Comment
User Comments (0)
About PowerShow.com