Title: Pairwise Sequence Alignment and Scoring Matrices
1Pairwise Sequence Alignmentand Scoring Matrices
Stat 115 Lecture 3
- Xiaole Shirley Liu
- And
- Jun Liu
2Mol Bio Quick Facts
- Building block of DNA is deoxyribonucleic acid
- Building block of protein is amino acid
- Protein is a peptide, a long peptide
- Only exons can code for functional proteins or
RNAs - Introns are spliced out
3Outline
- Motivation and introduction
- Dynamic programming
- Global sequence alignment
- Needleman-Wunsch
- 3 steps Initialize, fill matrix, trace back
- Gap penalties
- Local sequence alignment, Smith-Waterman
- Scoring matrices
- PAM
- BLOSUM
4Pairwise Sequence Alignment
- Given two sequences, scoring for
match/mismatch/gap - Goal find pairing of letters in the two
sequences that optimize the total score - This is a hard example.
- That is another easy example.
- This is a --hard---- example.
-
- That is another easy example.
5Align Biological Sequences
- DNA (4 nt gap)
- TTGATCAC
- TTTA-CAC
- Protein (20 aa gap)
- RKVA--GMAKPNM
- RKIAVAAASKPAV
- Sometimes gt 4 nt for DNA and gt 20 aa for proteins
- A word on IUPAC
6IUPAC for DNA
- A adenosine
- C cytidine
- G guanine
- T thymidine
- U uridine
- R G A (purine)
- Y T C (pyrimidine)
- K G T (keto)
- M A C (amino)
- S G C (strong)
- W A T (weak)
- B C G T (not A)
- D A G T (not C)
- H A C T (not G)
- V A C G (not T)
- N A C G T (any)
- gap
7IUPAC for Protein
- A Ala
- B Asp or Asn
- C Cys
- D Asp
- E Glu
- F Phe
- G Gly
- H His
- I Iso
- K Lys
- L Leu
- M Met
- N Asn
- P Pro
- G Gln
- R Arg
- S Ser
- T Thr
- U Sel
- V Val
- W Try
- Y Tyr
- Z Glu or Gln
- X Any
- Translation stop
- Gap
8Why Align Two Sequences
- If two sequences are similar, they might share
the same ancestor - If two sequences are similar, they may share the
same structure, therefore similar function - In genome sequencing assembly, if two sequences
have overlapping similar regions, they might be
connected to represent longer sequenced region.
9Scoring Schemes
- Match, mismatch, gap score determines the final
alignment - This is a --hard---- example.
-
- That is another easy example.
- Match OK, mismatch costly, gap cheap.
- This is a-- h-ard---- example.
- Th--at is anothe-r easy example.
- Match cheap, mismatch cheap, gap costly.
- This is a hard example.------
- That is another easy example.
10Dot Matrix Approach
- Naïve algorithm
- Dot match, find diagonal lines
- Cant afford more complex scoring
- Visual analysis,
- hard to find
- optimal alignments
11Dynamic Programming
- Essence of dynamic programming
- Store the sub-problem solutions for later use
- Best alignment at (i,j) is the best alignment
previous to (i,j) plus aligning these two - Earliest method, Needleman Wunsch 1969
- Still the best (sensitive and optimal) algorithm
for pair-wise alignment
12Dynamic Programming
- Best alignment at (i,j) is the best alignment
previous to (i,j) plus aligning these two
Best previous alignment
i
j
13Dynamic Programming
- Best alignment at (i,j) is the best alignment
previous to (i,j) plus aligning these two - Repeat the process until reaching the two
sequences ends
New best alignment Best previous alignment
align (i,j)
i
j
14Dynamic Programming Steps
- Initialize NxM matrix
- N, M are the length of the two sequences
- Fill the matrix
- Each element record the current best score, and
pointer to the previous best alignment - Always search the previous column and row for the
best previous alignment - Trace back to obtain optimal alignment
15Fill the Matrix
- BestScorei, j maxBSi-1, j - ? BSi, j-1
- ? BSi-1,j-1match(i,j) - ? A A T G C
- ? 0 -1 -2 -3 -4 -5
- A
- G
- G
- C
16Fill the Matrix
- BestScorei, j maxBSi-1, j - ? BSi, j-1
- ? BSi-1,j-1match(i,j) - ? A A T G C
- ? 0 -1 -2 -3 -4 -5
- A -1 1 0 -1 -2 -3
- G
- G
- C
17Fill the Matrix
- BestScorei, j maxBSi-1, j - ? BSi, j-1
- ? BSi-1,j-1match(i,j) - ? A A T G C
- ? 0 -1 -2 -3 -4 -5
- A -1 1 0 -1 -2 -3
- G -2 0 1 0 0 -1
- G
- C
18Fill the Matrix
- BestScorei, j maxBSi-1, j - ? BSi, j-1
- ? BSi-1,j-1match(i,j) - ? A A T G C
- ? 0 -1 -2 -3 -4 -5
- A -1 1 0 -1 -2 -3
- G -2 0 1 0 0 -1
- G -3 -1 0 1 1 0
- C
19Fill the Matrix
- BestScorei, j maxBSi-1, j - ? BSi, j-1
- ? BSi-1,j-1match(i,j) - ? A A T G C
- ? 0 -1 -2 -3 -4 -5
- A -1 1 0 -1 -2 -3
- G -2 0 1 0 0 -1
- G -3 -1 0 1 1 0
- C -4 -2 -1 0 1 2
20Fill the Matrix
- BestScorei, j maxBSi-1, j - ? BSi, j-1
- ? BSi-1,j-1match(i,j) - ? A A T G C
- ? 0 -1 -2 -3 -4 -5
- A -1 1 0 -1 -2 -3
- G -2 0 1 0 0 -1
- G -3 -1 0 1 1 0
- C -4 -2 -1 0 1 2
Trace back
AATGC -AGGC AATGC AG-GC
21Alignment Recursion (linear gap penalty)
C A T T G
0
-1
-2
-3
-4
-5
T C ATG
-1
0
-1
0
1
-2
0
-1
E.g.,
-3
3
2
1
0
5
-4
-1
3
4
5
4
6
-5
-2
2
22Affine Gap Penalty
- Gap penalty function
- Typically agtb (e.g., a12 b2)
- An order O(nm) algorithm.
Key keep 3 functions, each recording a
directional optimum.
23Gap Penalty
- Gap penalty g l ? e
- A A T G C
- A 1 1 0 0 0
- G 0 1 1 1.4 0.3
- G
- C
g gap start l gap length e gap extend e.g.
g -0.5 e -0.1
24Gap Penalty
- Gap penalty g l ? e
- A A T G C
- A 1 1 0 0 0
- G 0 1 1 1.4 0.3
- G 0 0.4 1 2 1.4
- C 0 0.3 0.4 1 3
g gap start l gap length e gap extend e.g.
g -0.5 e -0.1
25End Gaps
- Should we penalize gaps at the ends?
- ATCCGCATACCGGA
- --CCGCATAC----
- If two sequences similar length and entire
sequences are supposed to be similar, penalize. - If two sequences have very different length, do
not penalize (most of the time, ignore end gap
penalties)
26Global vs. Local Alignment
- Global Needleman-Wunsch
- Find best alignment for the whole 2 sequences
- Could have no penalty for mismatches/gaps
- Trace back from lower right corner to upper left
corner - Local Smith-Waterman
- Find high scoring subsequences
- E.g. Two proteins only share one similar
functional domain - Can be achieved by modifying global alignment
method
27Local Alignment Modifications
- Use negative mismatch and gap penalties
- The minimum score for i, j is 0
- If Si,j lt 0, rewrite it to 0, point to self
- If previous col/row is all 0, Si,j point to
self - The best score is sought anywhere in the matrix
- Not just last column and last row (should keep a
global pointer to the best score) - Trace back until a cell pointing to itself (not
necessary to the beginning of the two sequences)
28Global vs Local
29Matrix Filling in Smith-Waterman
max k lt j S(i,j-k)
g(k) S(i,j) max
S(i-1,j-1) m(i,j) max l
lt i S(i-l,j) g(l) 0
S(i,j-2)
g(2)
S(i-1,j-1)
S(i,j)
S(i-3,j)
g(3)
30Example
31Finding suboptimal alignments
32Smith-Waterman
- Negative mis-match negative gaps
- Scoring matrix gt 0
- Trace from maximum
33Scoring Matrices
- For DNA, match 5, mismatch 4
- For proteins, different amino acid pairs receive
different scores - Consideration size, shape, electric charge, van
der Waals interaction, ability to form salt,
hydrophobic, and hydrogen bonds - Substitution matrices
- Often symmetrical
- / scores
- functional similarity
34PAM Matrices
- MO Dayhoff 1978
- PAM percent accepted mutations
- Database of 1572 changes in 71 groups of closely
related proteins (gt 85 similar) - Construct phylogenetic tree of each group,
tabulate probability of amino acid changes
between every pairs of aa - For statistician Markov chain transition b/w aa
pairs
35PAM Matrix Family
- PAM-N
- PAM-0 1 on diagonal and 0 all the rest
- PAM-N what would happen if N out of 100 aa
mutate - For statisticians matrix multiplication N times
- Bigger N, more diverged substitution matrice
- Final matrix
36PAM250
- 250 substitutions in 100 residues, only 1/5
residue remain unchanged
37BLOSUM Matrices
- BLOcks amino acid SUbstitution Matrices
- Henikoff and Henikoff, PNAS. 1992, 8910915-9
- Check gt500 protein families in the Prosite
database (Bairoch 1991) - Find 2000 blocks of aligned segments
- BLOCKS database
- Characterize ungapped patterns 3-60 aa long
- Check aligned columns for observed substitutions
38BLOSUM Matrix Entry
- How frequently do aa appear
- How often do we expect to see i, j together
- How often do we actually see them together in all
the alignments - BLOSUM entry
39BLOSUM62 Matrix
40BLOSUM Matrices
- Blocks are grouped before looking at aa
substitutions - BLOSUMN if sequences gt N identical, their
contributions are weighted to sum to 1 - Most widely used BLOSUM62 and PAM250
41More About Dynamic Programming
- Example 1 Suppose I have x0 amount of savings at
retirement, and also receive st amount of social
security payment every year. Annual interest rate
is qt , what is an optimal spending plan if I
want to leave zero dollars at the end (say, year
5)?
Maximizing total spending?
Year 0 Year 1 Year 2 Year 3 Year 4
x0 x1 x2 x3 x4
s0 s1 s2 s3 s4
u0 u1 u2 u3 u4
42Example 2 Secretary Problem
- We get to observe the qualities of m
secretaries X1,, Xm sequentially according to a
random order. Our goal is to maximize the
probability of finding the best candidate with
no looking back! - Heuristic We start our reasoning backwards.
Suppose we have seen X1,, Xj, should we stop or
go on?
43Lets start reasoning ...
- What if we wait till the last person?
- What if we wait till having two people left?
- Strategy if m-1st person is better than previous
ones, take her otherwise wait till the last one. - Get a recursion? If we let go j-1 people, and
take the best-person-so-far starting from jth
person ...
Pj maximized at
44Final answer
- We should reject the first 37 of the candidates
and start to look recruit the first person who
is the best among all that have been interviewed.
- The chance of getting the best one is 37 !
45Summary
- Dynamic programming finds optimal alignment
between two sequences - Keep subproblem solution for later use
- Needleman Wunsch, global
- Smith Waterman, local
- Scoring sensitive to gap penalty and substitution
matrices used - Substitution matrices capture aa similarity
- PAM and BLOSUM matrices
46Question for Thoughts
- Given a string of integers (both positive and
negative) - E.g. 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1,
-7, 6 - Can you read each number only ONCE, and tell from
which number to which number you will get the
largest sum? - 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1, -7,
6, -2 - Largest sum 12
- Hint dynamic programming
47Acknowledgement
- Russ Altman
- Theodor Hanekamp
- Eric Rouchka