Title: Sequence Similarity
1Sequence Similarity
2Why sequence similarity
- structural similarity
- gt25 sequence identity ? similar structure
- evolutionary relationship
- all proteins come from lt 2000 (super)families
- related functional role
- similar structure ? similar function
- functional modules are often preserved
3Muscle cells and contraction
4Actin and myosin during muscle movement
5Actin structure
6Actin sequence
- Actin is ancient and abundant
- Most abundant protein in cells
- 1-2 actin genes in bacteria, yeasts, amoebas
- Humans 6 actin genes
- ?-actin in muscles ?-actin, ?-actin in
non-muscle cells - 4 amino acids different between each version
- MUSCLE ACTIN Amino Acid Sequence
- 1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG
VMVGMGQKDS YVGDEAQSKR - 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP
VLLTEAPLNP KANREKMTQI - 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN
VPIYEGYALP HAIMRLDLAG - 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD
FEQEMATAAS SSSLEKSYEL - 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI
MKCDIDIRKD LYANNVLSGG - 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG
GSILASLSTF QQMWITKQEY - 361 DESGPSIVHR KCF
7A related protein in bacteria
8Relation between sequence and structure
9A multiple alignment of actins
10Gene expression
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
11Biomolecules as Strings
- Macromolecules are the chemical building blocks
of cells - Proteins
- 20 amino acids
- Nucleic acids
- 4 nucleotides A, C, G, ,T
- Polysaccharides
12The information is in the sequence
- Sequence ? Structure ? Function
- Sequence similarity
- ? Structural and/or Functional similarity
- Nucleic acids and proteins are related by
molecular evolution - Orthologs two proteins in animals X and Y that
evolved from one protein in immediate ancestor
animal Z - Paralogs two proteins that evolved from one
protein through duplication in some ancestor - Homologs orthologs or paralogs that exhibit
sequence similarity
13Protein Phylogenies
- Proteins evolve by both duplication and species
divergence
duplication
orthologs
paralogs
14Evolution
15Evolution at the DNA level
Mutation
Deletion
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC - - - - CAGTCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
16Evolutionary Rates
next generation
OK
OK
OK
Changes in non-functional sites are OK, so will
be propagated
X
X
Still OK?
Most changes in functional sites are deleterious
and will be rejected
17Sequence conservation implies function
Proteins between humans and rodents are on
average 85 identical
18Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, M in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
19What is a good alignment?
- Alignment
- The best way to match the letters of one
sequence with those of the other - How do we define best?
- Alignment
- A hypothesis that the two sequences come from a
common ancestor through sequence edits - Parsimonious explanation
- Find the minimum number of edits that transform
one sequence into the other
20Scoring Function
- Sequence edits
- AGGCCTC
- Mutations
- AGGACTC
- Insertions
- AGGGCCTC
- Deletions
- AGGCTC
- Scoring Function
- Match m
- Mismatch s
- Gap d
- Score F ( matches) ? m ( mismatches) ? s
(gaps) ? d
21How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible alignments O( 2MN)
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
22Alignment is additive
- Observation
- The score of aligning x1xM
- y1yN
- is additive
- Say that x1xi xi1xM
- aligns to y1yj yj1yN
- The two scores add up
-
- F(x1M, y1N) F(x1i, y1j)
F(xi1M, yj1N) - Key property optimal solution to the entire
problem is composed of optimal solutions to
subproblems Dynamic Programming
23Dynamic Programming
- Construct a DP matrix F MxN
- Suppose we wish to align
- x1xM
- y1yN
- Let
- F(i, j) optimal score of aligning
- x1xi
- y1yj
24Dynamic Programming (contd)
- Notice three possible cases
- xi aligns to yj
- x1xi-1 xi
- y1yj-1 yj
- 2. xi aligns to a gap
- x1xi-1 xi
- y1yj -
- yj aligns to a gap
- x1xi -
- y1yj-1 yj
m, if xi yj F(i, j) F(i-1, j-1)
-s, if not
F(i, j) F(i-1, j) d
F(i, j) F(i, j-1) d
25Dynamic Programming (contd)
- How do we know which case is correct?
- Inductive assumption
- F(i, j 1), F(i 1, j), F(i 1, j 1) are
optimal - Then,
- F(i 1, j 1) s(xi, yj)
- F(i, j) max F(i 1, j) d
- F(i, j 1) d
- Where s(xi, yj) m, if xi yj -s, if not
26Example
- x AGTA m 1
- y ATA s -1
- d -1
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4, 3) 2 AGTA A - TA
j 0
1
2
3
27The Needleman-Wunsch Algorithm
- Initialization.
- F(0, 0) 0
- F(0, j) - j ? d
- F(i, 0) - i ? d
- Main Iteration. Filling-in partial alignments
- For each i 1M
- For each j 1N
- F(i-1,j-1) s(xi, yj) case 1
- F(i, j) max F(i-1, j) d case
2 - F(i, j-1) d case 3
- DIAG, if case 1
- Ptr(i,j) LEFT, if case 2
- UP, if case 3
- Termination. F(M, N) is the optimal score, and
- from Ptr(M, N) can trace back optimal alignment
28Performance
- Time
- O(NM)
- Space
- O(NM)
- Possible to reduce space to O(NM) using
Hirschbergs divide conquer algorithm
29Substitutions of Amino Acids
- Mutation rates between amino acids have dramatic
differences!
How can we quantify the differences in rates by
which one amino acid replaces another across
related proteins?
30Substitution Matrices
- BLOSUM matrices
- Start from BLOCKS database (curated, gap-free
alignments) - Cluster sequences according to gt X identity
- Calculate Aab of aligned a-b in distinct
clusters, correcting by 1/mn, where m, n are the
two cluster sizes - Estimate
- P(a) (?b Aab)/(?cd Acd) P(a, b) Aab/(?cd
Acd)
31Gaps are not inserted uniformly
32A state model for alignment
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
33Lets score the transitions
s(xi, yj)
M (1,1)
s(xi, yj)
s(xi, yj)
Alignments correspond 1-to-1 with sequences of
states M, I, J
-d
-d
I (1, 0)
J (0, 1)
-e
-e
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
34A probabilistic model for alignment
- Assign probabilities to every transition (arrow),
and emission (pair of letters or gaps) - Probabilities of mutation reflect amino acid
similarities - Different probabilities for opening and
extending gap
M (1,1)
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
35A Pair HMM for alignments
log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
log ?
log ?
I P(xi)
J P(yj)
Highest scoring path corresponds to the most
likely alignment!
36How do we find the highest scoring path?
- Compute the following matrices (DP)
- M(i, j) most likely alignment of x1xi with
y1yj ending in state M - I(i, j) most likely alignment of x1xi with
y1yj ending in state I - J(i, j) most likely alignment of x1xi with
y1yj ending in state J - M(i, j) log( Prob(xi, yj) )
- max M(i-1, j-1) log(1-2?),
- I(i-1, j) log(1-?),
- J(i, j-1) log(1-?)
- I(i, j) max M(i-1, j) log ?,
- I(i-1, j) log ?
log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
37The Viterbi algorithm for alignment
- For each i 1, , M
- For each j 1, , N
- M(i, j) log( Prob(xi, yj) )
- max M(i-1, j-1) log(1-2?),
- I(i-1, j) log(1-?),
- J(i, j-1) log(1-?)
-
- I(i, j) max M(i-1, j) log ?,
- I(i-1, j) log ?
-
- J(i, j) max M(i-1, j) log ?,
- I(i-1, j) log ?
-
- When matrices are filled, we can trace back from
(M, N) the likeliest alignment
38One way to view the state paths State M
y1
yn
x1
xm
39State I
y1
yn
x1
xm
40State J
y1
yn
x1
xm
41Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1)
y1
yn
x1
xm
42Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1) Optimal
solution is the best scoring path from top-left
to bottom-right corner This gives the likeliest
alignment according to our HMM
y1
yn
x1
xm
43Yet another way to represent this model
Ix
Ix
BEGIN
END
Iy
Iy
Mx1
Mxm
Sequence X
We are aligning, or threading, sequence Y through
sequence X Every time yj lands in state xi, we
get substitution score s(xi, yj) Every time yj
is gapped, or some xi is skipped, we pay gap
penalty