Sequence Alignment - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Sequence Alignment

Description:

S. B. Needleman ... max ( subproblem ( i 1, j ) , subproblem ( i, j 1 ) ); Memorizing ... gap x in sequence B } max { S[i, j y] w( y ); gap y in sequence ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 64
Provided by: aleph0
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
  • Arthur Chou
  • Clark University
  • Spring 2005

2
Sequence Alignment
  • Input two sequences over the same alphabet
  • Output an alignment of the two sequences
  • Example
  • GCGCATGGATTGAGCGA
  • TGCGCCATTGATGACCA
  • A possible alignment
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A

3
Why align sequences?
  • Lots of sequences dont have known ancestry,
    structure, or function. A few of them do.
  • If they align, they are similar.
  • If they are similar, they might have the same
  • ancestry, similar structure or function.
  • If one of them has known ancestry, structure,
    or
  • function, then alignment to the others yields
  • insight about them.

4
Alignments
-GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three
kinds of match Exact matches Mismatches
Indels (gaps)
5
Choosing Alignments
  • There are many possible alignments
  • For example, compare
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • to
  • ------GCGCATGGATTGAGCGA
  • TGCGCC----ATTGATGACCA--
  • Which one is better?

6
Scoring Alignments
  • Similar sequences evolved from a common ancestor
  • Evolution changed the sequences from this
    ancestral sequence by mutations
  • Replacement one letter replaced by another
  • Deletion deletion of a character
  • Insertion insertion of a character
  • Scoring of sequence similarity should examine how
    many and which operations took place

7
Simple Scoring Rule
  • Score each position independently
  • Match 1
  • Mismatch -1
  • Indel -2
  • Score of an alignment is sum of position scores

8
Example
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • Score (1?13) (-1 ? 2) (-2 ? 4)
    3
  • ------GCGCATGGATTGAGCGA
  • TGCGCC----ATTGATGACCA--
  • Score (1 ? 5) (-1 ? 6) (-2 ? 11)
    -23

9
More General Scores
  • The choice of 1,-1, and -2 scores is quite
    arbitrary
  • Depending on the context, some changes are more
    plausible than others
  • Exchange of an amino-acid by one with similar
    properties (size, charge, etc.) vs.
  • Exchange of an amino-acid by one with opposite
    properties
  • Probabilistic interpretation How likely is one
    alignment versus another ?

10
Dot Matrix Method
  • A dot is placed at each position where two
    residues match.
  • It's a visual aid. The human eye can rapidly
    identify similar regions in sequences.
  • It's a good way to explore sequence organization
    e.g. sequence repeats.
  • It does not provide an alignment.

THEFA-TCAT THEFASTCAT
  • This method produces dot-plots with too much
    noise
  • to be useful
  • The noise can be reduced by calculating a score
    using a window of residues.
  • The score is compared to a threshold or
    stringency.

11
Dot Matrix Representation
  • Produces a graphical representation of similarity
    regions
  • The horizontal and vertical dimensions correspond
    to the compared sequences
  • A region of similarity stands out as a diagonal

12
Dot Matrix or Dot-plot
  • Each window of the first sequence is aligned
    (without
  • gaps) to each window of the 2nd sequence
  • A colour is set into a rectangular array
    according to the
  • score of the aligned windows

13
Dot Matrix Display
  • Diagonal rows ( ) of dots
  • reveal sequence similarity
  • or repeats.
  • Anti-diagonal rows ( )
  • of dots represent inverted
  • repeats.
  • Isolated dots represent
  • random similarity.

H C G E T F G R W F T P E W K C G
P T
F G R
I A C G E
M
14
We can filter it by using a sliding window
looking for longer strings of matches and
eliminates random matches
15
Longest Common Subsequence
  • Sequence A nematode_knowledge
  • Sequence B empty_bottle
  • n e m a t o d e _ k n o w l e d g e
  • e m p t y _ b o t t l e
  • LCS Alignment with match score 1,
  • mismatch score 0, and gap penalty 0

16
What is an algorithm?
  • A step-by-step description of the procedures to
    accomplish a task.
  • Properties
  • Determination of output for each input
  • Generality
  • Termination
  • Criteria
  • Correctness (proof, test, etc.)
  • Time efficiency (no. of steps is small)
  • Space efficiency (spaced used is small)

17
Naïve algorithm exhaustive search
  • G C G A A T G G A T T G A G C G T
  • T G A G C C A T T G A T G A C C A

sequences of length n
i
j
i j j i j i j j i i . . . . . . . . . . . . . .
2n
Worst case time complexity is 2
18
Dynamic programming algorithms for pairwise
sequence alignment
  • Similar to Longest Common Subsequence
  • Introduced for biological sequences by
  • S. B. Needleman C. D. Wunsch. A general method
    applicable to the search for similarities in the
    amino acid sequence of two proteins. J. Mol.
    Biol. 48443-453 (1970)

19
Dynamic Programming
  • Optimality substructure
  • Reduction to a small number of sub-problems
  • Memorization of solutions to sub-problems in a
    table
  • Table look-up and tracing

- G C G C A T G G A T T G A G C G A T G C G C C
A T T G A T G A C C - A
20
Recursive LCS
lcs_len( i , j ) length of LCS from i-th
position onward in String A and from j-th
position onward in String B
int lcs_len ( i , j ) if (A i \0
B j \0 ) return 0 else
if (A i B j ) return ( 1 lcs_len (
i1, j1 ) ) else return max ( lcs_len
( i1, j ) , lcs_len ( i, j1 )
)
21
Reduction to Subproblems
int lcs_len ( String A , String B )
return subproblem ( 0, 0 ) int
subproblem ( int i, int j ) if (A i
\0 B j \0) return 0 else
if ( A i B j ) return (1
subproblem ( i1, j1 )) else return max (
subproblem ( i1, j ) , subproblem (
i, j1 ) )
22
Memorizing the solutions
  • Matrix L i , j -1 // initializing
    the memory device
  • int subproblem ( int i, int j )
  • if ( Li, j lt 0 )
  • if (A i \0 B j \0) Li ,
    j 0
  • else if ( A i B j )
  • Li, j 1 subproblem(i1,
    j1)
  • else Li, j max( subproblem(i1,
    j),
  • subproblem(i, j1))
  • return L i, j

23
Iterative LCS Table Look-up
To get the length of LCS of Sq. A and Sq. B
first allocate storage for the matrix L
for each row i from m downto 0 for each
column j from n downto 0 if (A i
\0 or B j \0) L i, j 0
else if (A i B j ) L i, j 1
Li1, j1 else L i, j
max(Li1, j, Li, j1)
return L0, 0
24
Iterative LCS Table Look-up
  • int lcs_len ( String A , String B ) // the
    length
  • // First allocate storage for the matrix L
  • for ( i m i gt 0 i-- ) // A has
    length m1
  • for ( j n j gt 0 j-- ) // B
    has length n1
  • if (A i \0 B j \0)
    L i, j 0
  • else if (A i B j ) L i, j 1
    Li1, j1
  • else L i, j max(Li1, j,
    Li, j1)
  • return L0, 0

25
Dynamic Programming Algorithm
  • Li, j 1 Li1, j1 , if A i B
    j
  • Li, j max ( Li1, j, Li, j1 )
    otherwise

j
j1
B
A
Matrix L
L i, j
L i, j1
i
Li1, j1
L i1, j
i1
26
n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d 
g  ee   7  7  6  5  5  5  5  5  4  3  3  3  2 
2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4 
3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5 
4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5 
5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1 
1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2 
2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4 
3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3 
3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3 
3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1 
1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2 
2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2 
2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2 
2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1 
1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
1  0    0  0  0  0  0  0  0  0  0  0  0  0  0 
0  0  0  0  0  0
27
Obtain the subsequence
  • Sequence S empty // the LCS
  • i 0 j 0
  • while ( i lt m j lt n)
  • if ( A i B j )
  • add Ai to end of S
  • i j
  • else
  • if ( Li1, j gt Li, j1) i
  • else j

28
n e m a t o d e _ k n o w l e d g e   
e  o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o      
  \   \   \   \    m 
o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o        
\     p 
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o      
    t 
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o      
  \     y 
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o      
    _ 
o-o-o-o-o-o-o-o-o o-o-o-o-o-o-o-o-o-o      
  \     b 
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o      
    o 
o-o-o-o-o-o o-o-o-o o o o-o-o-o-o-o-o      
  \       t 
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o      
  \     t 
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o      
  \     l 
o-o-o-o-o-o-o-o-o-o-o-o-o-o o-o-o-o-o      
  \     e  o-o
o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o         \
  \   \   \       o o o o o o
o o o o o o o o o o o o o
29
Dynamic Programming with scores and penalties
x
  •    

y
j
30
Dynamic Programming with scores and penalties
  • from i-th pos. in A and j-th pos. in B
    onward
  • s ( Ai , Bj )
    Si1, j1
  • Si , j max max Six, j w( x )
  • gap x in
    sequence B
  • max Si, jy
    w( y )
  • gap y in sequence A

s score
w penalty function
best score from i, j onward
31
Algorithm for simple gap penalty
  • If for each gap, the penalty is a fixed constant
    c, then
  • s(A i , B j ) Si1, j1
  • Si , j max S i1, j c //
    one gap
  • S i, j1 c // one gap

32
Table Tracing
  • To do table tracing based on similarity matrix
    of amino acids, we re-define Si , j to be the
    optimal score of choosing the match of Ai with
    Bj.
  • S i , j s (A i , B j ) // s
    score
  • Si1, j1 // w
    gap penalty
  • max Si1x, j1 w( x )
  • max gap x in sequence B
  • max Si1, j1y w( y )
  • gap y in
    sequence A

33
Diagram
Matrix S
j
j1
i
i1
34
Summation operation
  • 1. Start at lower right corner.
  • 2. Move diagonally up one position.
  • 3. Find largest value in either
  • ? row segment starting diagonally below
    current position and extending to the right or
  • ? column segment starting diagonally
    below current position and extending down.
  • 4. Add this value to the value in the current
    cell.
  • 5. Repeat steps 3 and 4 for all cells to the left
    in current row and all cells above in current
    column.
  • 6. If we are not in the top left corner, go to
    step 2.

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
----V HGQKV
51
(No Transcript)
52
----VA HGQKVA
53
----VADALTK HGQKVADALTK
54
----VADALTK HGQKVADALTK
55
----VADALTKPVNFKFA HGQKVADALTK------A
56
----VADALTKPVNFKFAVAH HGQKVADALTK------AVAH
57
Dynamic Programming by Tracing a Similarity Matrix
  • Recall the algorithm Table Tracing
  • Tracing a Similarity Matrix of Amino Acids

58
Use of dynamic programming to evaluate homology
between pairs of sequences
  • If we just want to know maximum match possible
    between two sequences, then we dont need to do
    trace-back but can just look at the highest value
    in the first row or column (match score). This
    represents the best possible alignment score.

59
Gap penalty alternatives
  • constant gap penalty for gap gt 1
  • gap penalty proportional to gap size (affine gap
    penalty)
  • one penalty for starting a gap (gap opening
    penalty)
  • different (lower) penalty for adding to a gap
    (gap extension penalty)
  • dynamic programming algorithm can be made more
    efficient

60
Gap penalty alternatives (cont.)
  • gap penalty proportional to gap size and sequence
  • for nucleic acids, can be used to mimic
    thermodynamics of helix formation.
  • two kinds of gap opening penalties
  • one for gap closed by AT, different for GC.
  • different gap extension penalty.

61
End gaps
  • Some programs treat end gaps as normal gaps and
    apply penalties, other programs do not apply
    penalties for end gaps.

62
End gaps (cont.)
  • Can determine which a program does by adding
    extra (unmatched) bases to the end of one
    sequence and seeing if match score changes.
  • Penalties for end gaps appropriate for aligned
    sequences where ends "should match.
  • Penalties for end gaps inappropriate when
    surrounding sequences are expected to be
    different (e.g., conserved exon surrounded by
    varying introns).

63
Global vs. Local Similarity
  • Should result of alignment include all amino
    acids or proteins or just those that match?
  • If yes, a global alignment is desired
  • If no, a local alignment is desired
  • Global alignment is accomplished by including
    negative scores for mismatched positions, thus
    scores get worse as we move away from region of
    match (local alignment).
  • Instead of starting trace-back with highest value
    in first row or column, start with highest value
    in entire matrix, stop when score hits zero.
Write a Comment
User Comments (0)
About PowerShow.com