Pairwise Sequence Alignment and Scoring Matrices - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Pairwise Sequence Alignment and Scoring Matrices

Description:

Title: PowerPoint Presentation Author: Xiaole Shirley Liu Last modified by: Jun Liu Created Date: 1/3/2005 7:27:35 PM Document presentation format – PowerPoint PPT presentation

Number of Views:327
Avg rating:3.0/5.0
Slides: 47
Provided by: XiaoleSh
Category:

less

Transcript and Presenter's Notes

Title: Pairwise Sequence Alignment and Scoring Matrices


1
Pairwise Sequence Alignmentand Scoring Matrices
Stat 115 Lecture 3
  • Xiaole Shirley Liu
  • And
  • Jun Liu

2
Mol Bio Quick Facts
  • Building block of DNA is deoxyribonucleic acid
  • Building block of protein is amino acid
  • Protein is a peptide, a long peptide
  • Only exons can code for functional proteins or
    RNAs
  • Introns are spliced out

3
Outline
  • Motivation and introduction
  • Dynamic programming
  • Global sequence alignment
  • Needleman-Wunsch
  • 3 steps Initialize, fill matrix, trace back
  • Gap penalties
  • Local sequence alignment, Smith-Waterman
  • Scoring matrices
  • PAM
  • BLOSUM

4
Pairwise Sequence Alignment
  • Given two sequences, scoring for
    match/mismatch/gap
  • Goal find pairing of letters in the two
    sequences that optimize the total score
  • This is a hard example.
  • That is another easy example.
  • This is a --hard---- example.
  • That is another easy example.

5
Align Biological Sequences
  • DNA (4 nt gap)
  • TTGATCAC
  • TTTA-CAC
  • Protein (20 aa gap)
  • RKVA--GMAKPNM
  • RKIAVAAASKPAV
  • Sometimes gt 4 nt for DNA and gt 20 aa for proteins
  • A word on IUPAC

6
IUPAC for DNA
  • A adenosine
  • C cytidine
  • G guanine
  • T thymidine
  • U uridine
  • R G A (purine)
  • Y T C (pyrimidine)
  • K G T (keto)
  • M A C (amino)
  • S G C (strong)
  • W A T (weak)
  • B C G T (not A)
  • D A G T (not C)
  • H A C T (not G)
  • V A C G (not T)
  • N A C G T (any)
  • gap

7
IUPAC for Protein
  • A Ala
  • B Asp or Asn
  • C Cys
  • D Asp
  • E Glu
  • F Phe
  • G Gly
  • H His
  • I Iso
  • K Lys
  • L Leu
  • M Met
  • N Asn
  • P Pro
  • G Gln
  • R Arg
  • S Ser
  • T Thr
  • U Sel
  • V Val
  • W Try
  • Y Tyr
  • Z Glu or Gln
  • X Any
  • Translation stop
  • Gap

8
Why Align Two Sequences
  • If two sequences are similar, they might share
    the same ancestor
  • If two sequences are similar, they may share the
    same structure, therefore similar function
  • In genome sequencing assembly, if two sequences
    have overlapping similar regions, they might be
    connected to represent longer sequenced region.

9
Scoring Schemes
  • Match, mismatch, gap score determines the final
    alignment
  • This is a --hard---- example.
  • That is another easy example.
  • Match OK, mismatch costly, gap cheap.
  • This is a-- h-ard---- example.
  • Th--at is anothe-r easy example.
  • Match cheap, mismatch cheap, gap costly.
  • This is a hard example.------
  • That is another easy example.

10
Dot Matrix Approach
  • Naïve algorithm
  • Dot match, find diagonal lines
  • Cant afford more complex scoring
  • Visual analysis,
  • hard to find
  • optimal alignments

11
Dynamic Programming
  • Essence of dynamic programming
  • Store the sub-problem solutions for later use
  • Best alignment at (i,j) is the best alignment
    previous to (i,j) plus aligning these two
  • Earliest method, Needleman Wunsch 1969
  • Still the best (sensitive and optimal) algorithm
    for pair-wise alignment

12
Dynamic Programming
  • Best alignment at (i,j) is the best alignment
    previous to (i,j) plus aligning these two

Best previous alignment
i
j
13
Dynamic Programming
  • Best alignment at (i,j) is the best alignment
    previous to (i,j) plus aligning these two
  • Repeat the process until reaching the two
    sequences ends

New best alignment Best previous alignment
align (i,j)
i
j
14
Dynamic Programming Steps
  • Initialize NxM matrix
  • N, M are the length of the two sequences
  • Fill the matrix
  • Each element record the current best score, and
    pointer to the previous best alignment
  • Always search the previous column and row for the
    best previous alignment
  • Trace back to obtain optimal alignment

15
Fill the Matrix
  • BestScorei, j maxBSi-1, j - ? BSi, j-1
    - ? BSi-1,j-1match(i,j)
  • ? A A T G C
  • ? 0 -1 -2 -3 -4 -5
  • A
  • G
  • G
  • C

16
Fill the Matrix
  • BestScorei, j maxBSi-1, j - ? BSi, j-1
    - ? BSi-1,j-1match(i,j)
  • ? A A T G C
  • ? 0 -1 -2 -3 -4 -5
  • A -1 1 0 -1 -2 -3
  • G
  • G
  • C

17
Fill the Matrix
  • BestScorei, j maxBSi-1, j - ? BSi, j-1
    - ? BSi-1,j-1match(i,j)
  • ? A A T G C
  • ? 0 -1 -2 -3 -4 -5
  • A -1 1 0 -1 -2 -3
  • G -2 0 1 0 0 -1
  • G
  • C

18
Fill the Matrix
  • BestScorei, j maxBSi-1, j - ? BSi, j-1
    - ? BSi-1,j-1match(i,j)
  • ? A A T G C
  • ? 0 -1 -2 -3 -4 -5
  • A -1 1 0 -1 -2 -3
  • G -2 0 1 0 0 -1
  • G -3 -1 0 1 1 0
  • C

19
Fill the Matrix
  • BestScorei, j maxBSi-1, j - ? BSi, j-1
    - ? BSi-1,j-1match(i,j)
  • ? A A T G C
  • ? 0 -1 -2 -3 -4 -5
  • A -1 1 0 -1 -2 -3
  • G -2 0 1 0 0 -1
  • G -3 -1 0 1 1 0
  • C -4 -2 -1 0 1 2

20
Fill the Matrix
  • BestScorei, j maxBSi-1, j - ? BSi, j-1
    - ? BSi-1,j-1match(i,j)
  • ? A A T G C
  • ? 0 -1 -2 -3 -4 -5
  • A -1 1 0 -1 -2 -3
  • G -2 0 1 0 0 -1
  • G -3 -1 0 1 1 0
  • C -4 -2 -1 0 1 2

Trace back
AATGC -AGGC AATGC AG-GC
21
Alignment Recursion (linear gap penalty)
C A T T G
0
-1
-2
-3
-4
-5
T C ATG
-1
0
-1
0
1
-2
0
-1
E.g.,
-3
3
2
1
0
5
-4
-1
3
4
5
4
6
-5
-2
2
22
Affine Gap Penalty
  • Gap penalty function
  • Typically agtb (e.g., a12 b2)
  • An order O(nm) algorithm.

Key keep 3 functions, each recording a
directional optimum.
23
Gap Penalty
  • Gap penalty g l ? e
  • A A T G C
  • A 1 1 0 0 0
  • G 0 1 1 1.4 0.3
  • G
  • C

g gap start l gap length e gap extend e.g.
g -0.5 e -0.1
24
Gap Penalty
  • Gap penalty g l ? e
  • A A T G C
  • A 1 1 0 0 0
  • G 0 1 1 1.4 0.3
  • G 0 0.4 1 2 1.4
  • C 0 0.3 0.4 1 3

g gap start l gap length e gap extend e.g.
g -0.5 e -0.1
25
End Gaps
  • Should we penalize gaps at the ends?
  • ATCCGCATACCGGA
  • --CCGCATAC----
  • If two sequences similar length and entire
    sequences are supposed to be similar, penalize.
  • If two sequences have very different length, do
    not penalize (most of the time, ignore end gap
    penalties)

26
Global vs. Local Alignment
  • Global Needleman-Wunsch
  • Find best alignment for the whole 2 sequences
  • Could have no penalty for mismatches/gaps
  • Trace back from lower right corner to upper left
    corner
  • Local Smith-Waterman
  • Find high scoring subsequences
  • E.g. Two proteins only share one similar
    functional domain
  • Can be achieved by modifying global alignment
    method

27
Local Alignment Modifications
  • Use negative mismatch and gap penalties
  • The minimum score for i, j is 0
  • If Si,j lt 0, rewrite it to 0, point to self
  • If previous col/row is all 0, Si,j point to
    self
  • The best score is sought anywhere in the matrix
  • Not just last column and last row (should keep a
    global pointer to the best score)
  • Trace back until a cell pointing to itself (not
    necessary to the beginning of the two sequences)

28
Global vs Local
  • Needleman-Wunsch
  • Smith-Waterman

29
Matrix Filling in Smith-Waterman
max k lt j S(i,j-k)
g(k) S(i,j) max
S(i-1,j-1) m(i,j) max l
lt i S(i-l,j) g(l) 0
S(i,j-2)
g(2)
S(i-1,j-1)
S(i,j)
S(i-3,j)
g(3)
30
Example
31
Finding suboptimal alignments
32
Smith-Waterman
  • Negative mis-match negative gaps
  • Scoring matrix gt 0
  • Trace from maximum

33
Scoring Matrices
  • For DNA, match 5, mismatch 4
  • For proteins, different amino acid pairs receive
    different scores
  • Consideration size, shape, electric charge, van
    der Waals interaction, ability to form salt,
    hydrophobic, and hydrogen bonds
  • Substitution matrices
  • Often symmetrical
  • / scores
  • functional similarity

34
PAM Matrices
  • MO Dayhoff 1978
  • PAM percent accepted mutations
  • Database of 1572 changes in 71 groups of closely
    related proteins (gt 85 similar)
  • Construct phylogenetic tree of each group,
    tabulate probability of amino acid changes
    between every pairs of aa
  • For statistician Markov chain transition b/w aa
    pairs

35
PAM Matrix Family
  • PAM-N
  • PAM-0 1 on diagonal and 0 all the rest
  • PAM-N what would happen if N out of 100 aa
    mutate
  • For statisticians matrix multiplication N times
  • Bigger N, more diverged substitution matrice
  • Final matrix

36
PAM250
  • 250 substitutions in 100 residues, only 1/5
    residue remain unchanged

37
BLOSUM Matrices
  • BLOcks amino acid SUbstitution Matrices
  • Henikoff and Henikoff, PNAS. 1992, 8910915-9
  • Check gt500 protein families in the Prosite
    database (Bairoch 1991)
  • Find 2000 blocks of aligned segments
  • BLOCKS database
  • Characterize ungapped patterns 3-60 aa long
  • Check aligned columns for observed substitutions

38
BLOSUM Matrix Entry
  • How frequently do aa appear
  • How often do we expect to see i, j together
  • How often do we actually see them together in all
    the alignments
  • BLOSUM entry

39
BLOSUM62 Matrix
40
BLOSUM Matrices
  • Blocks are grouped before looking at aa
    substitutions
  • BLOSUMN if sequences gt N identical, their
    contributions are weighted to sum to 1
  • Most widely used BLOSUM62 and PAM250

41
More About Dynamic Programming
  • Example 1 Suppose I have x0 amount of savings at
    retirement, and also receive st amount of social
    security payment every year. Annual interest rate
    is qt , what is an optimal spending plan if I
    want to leave zero dollars at the end (say, year
    5)?

Maximizing total spending?
Year 0 Year 1 Year 2 Year 3 Year 4
x0 x1 x2 x3 x4
s0 s1 s2 s3 s4
u0 u1 u2 u3 u4

42
Example 2 Secretary Problem
  • We get to observe the qualities of m
    secretaries X1,, Xm sequentially according to a
    random order. Our goal is to maximize the
    probability of finding the best candidate with
    no looking back!
  • Heuristic We start our reasoning backwards.
    Suppose we have seen X1,, Xj, should we stop or
    go on?

43
Lets start reasoning ...
  • What if we wait till the last person?
  • What if we wait till having two people left?
  • Strategy if m-1st person is better than previous
    ones, take her otherwise wait till the last one.
  • Get a recursion? If we let go j-1 people, and
    take the best-person-so-far starting from jth
    person ...

Pj maximized at
44
Final answer
  • We should reject the first 37 of the candidates
    and start to look recruit the first person who
    is the best among all that have been interviewed.
  • The chance of getting the best one is 37 !

45
Summary
  • Dynamic programming finds optimal alignment
    between two sequences
  • Keep subproblem solution for later use
  • Needleman Wunsch, global
  • Smith Waterman, local
  • Scoring sensitive to gap penalty and substitution
    matrices used
  • Substitution matrices capture aa similarity
  • PAM and BLOSUM matrices

46
Question for Thoughts
  • Given a string of integers (both positive and
    negative)
  • E.g. 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1,
    -7, 6
  • Can you read each number only ONCE, and tell from
    which number to which number you will get the
    largest sum?
  • 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1, -7,
    6, -2
  • Largest sum 12
  • Hint dynamic programming

47
Acknowledgement
  • Russ Altman
  • Theodor Hanekamp
  • Eric Rouchka
Write a Comment
User Comments (0)
About PowerShow.com