Sequence Similarity - PowerPoint PPT Presentation

About This Presentation

Title:

Sequence Similarity

Description:

Sequence Similarity Why sequence similarity structural similarity 25% sequence identity similar structure evolutionary relationship all proteins come from – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 44

Provided by: root

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Similarity

1
Sequence Similarity
2
Why sequence similarity

structural similarity
gt25 sequence identity ? similar structure
evolutionary relationship
all proteins come from lt 2000 (super)families
related functional role
similar structure ? similar function
functional modules are often preserved

3
Muscle cells and contraction
4
Actin and myosin during muscle movement
5
Actin structure
6
Actin sequence

Actin is ancient and abundant
Most abundant protein in cells
1-2 actin genes in bacteria, yeasts, amoebas
Humans 6 actin genes
?-actin in muscles ?-actin, ?-actin in
non-muscle cells
4 amino acids different between each version
MUSCLE ACTIN Amino Acid Sequence
1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG
VMVGMGQKDS YVGDEAQSKR
61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP
VLLTEAPLNP KANREKMTQI
121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN
VPIYEGYALP HAIMRLDLAG
181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD
FEQEMATAAS SSSLEKSYEL
241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI
MKCDIDIRKD LYANNVLSGG
301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG
GSILASLSTF QQMWITKQEY
361 DESGPSIVHR KCF

7
A related protein in bacteria
8
Relation between sequence and structure
9
A multiple alignment of actins
10
Gene expression
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
11
Biomolecules as Strings

Macromolecules are the chemical building blocks
of cells
Proteins
20 amino acids
Nucleic acids
4 nucleotides A, C, G, ,T
Polysaccharides

12
The information is in the sequence

Sequence ? Structure ? Function
Sequence similarity
? Structural and/or Functional similarity
Nucleic acids and proteins are related by
molecular evolution
Orthologs two proteins in animals X and Y that
evolved from one protein in immediate ancestor
animal Z
Paralogs two proteins that evolved from one
protein through duplication in some ancestor
Homologs orthologs or paralogs that exhibit
sequence similarity

13
Protein Phylogenies

Proteins evolve by both duplication and species
divergence

duplication
orthologs
paralogs
14
Evolution
15
Evolution at the DNA level
Mutation
Deletion
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC - - - - CAGTCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
16
Evolutionary Rates

next generation
OK

OK

OK

Changes in non-functional sites are OK, so will
be propagated
X

X

Still OK?

Most changes in functional sites are deleterious
and will be rejected
17
Sequence conservation implies function
Proteins between humans and rodents are on
average 85 identical
18
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, M in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
19
What is a good alignment?

Alignment
The best way to match the letters of one
sequence with those of the other
How do we define best?
Alignment
A hypothesis that the two sequences come from a
common ancestor through sequence edits
Parsimonious explanation
Find the minimum number of edits that transform
one sequence into the other

20
Scoring Function

Sequence edits
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGGCTC
Scoring Function
Match m
Mismatch s
Gap d
Score F ( matches) ? m ( mismatches) ? s
(gaps) ? d

21
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible alignments O( 2MN)
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
22
Alignment is additive

Observation
The score of aligning x1xM
y1yN
is additive
Say that x1xi xi1xM
aligns to y1yj yj1yN
The two scores add up
F(x1M, y1N) F(x1i, y1j)
F(xi1M, yj1N)
Key property optimal solution to the entire
problem is composed of optimal solutions to
subproblems Dynamic Programming

23
Dynamic Programming

Construct a DP matrix F MxN
Suppose we wish to align
x1xM
y1yN
Let
F(i, j) optimal score of aligning
x1xi
y1yj

24
Dynamic Programming (contd)

Notice three possible cases
xi aligns to yj
x1xi-1 xi
y1yj-1 yj
2. xi aligns to a gap
x1xi-1 xi
y1yj -
yj aligns to a gap
x1xi -
y1yj-1 yj

m, if xi yj F(i, j) F(i-1, j-1)
-s, if not
F(i, j) F(i-1, j) d
F(i, j) F(i, j-1) d
25
Dynamic Programming (contd)

How do we know which case is correct?
Inductive assumption
F(i, j 1), F(i 1, j), F(i 1, j 1) are
optimal
Then,
F(i 1, j 1) s(xi, yj)
F(i, j) max F(i 1, j) d
F(i, j 1) d
Where s(xi, yj) m, if xi yj -s, if not

26
Example

x AGTA m 1
y ATA s -1
d -1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4, 3) 2 AGTA A - TA
j 0
1
2
3
27
The Needleman-Wunsch Algorithm

Initialization.
F(0, 0) 0
F(0, j) - j ? d
F(i, 0) - i ? d
Main Iteration. Filling-in partial alignments
For each i 1M
For each j 1N
F(i-1,j-1) s(xi, yj) case 1
F(i, j) max F(i-1, j) d case
2
F(i, j-1) d case 3
DIAG, if case 1
Ptr(i,j) LEFT, if case 2
UP, if case 3
Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment

28
Performance

Time
O(NM)
Space
O(NM)
Possible to reduce space to O(NM) using
Hirschbergs divide conquer algorithm

29
Substitutions of Amino Acids

Mutation rates between amino acids have dramatic
differences!

How can we quantify the differences in rates by
which one amino acid replaces another across
related proteins?
30
Substitution Matrices

BLOSUM matrices
Start from BLOCKS database (curated, gap-free
alignments)
Cluster sequences according to gt X identity
Calculate Aab of aligned a-b in distinct
clusters, correcting by 1/mn, where m, n are the
two cluster sizes
Estimate
P(a) (?b Aab)/(?cd Acd) P(a, b) Aab/(?cd
Acd)

31
Gaps are not inserted uniformly
32
A state model for alignment
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
33
Lets score the transitions
s(xi, yj)
M (1,1)
s(xi, yj)
s(xi, yj)
Alignments correspond 1-to-1 with sequences of
states M, I, J
-d
-d
I (1, 0)
J (0, 1)
-e
-e
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
34
A probabilistic model for alignment

Assign probabilities to every transition (arrow),
and emission (pair of letters or gaps)
Probabilities of mutation reflect amino acid
similarities
Different probabilities for opening and
extending gap

M (1,1)
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
35
A Pair HMM for alignments
log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
log ?
log ?
I P(xi)
J P(yj)
Highest scoring path corresponds to the most
likely alignment!
36
How do we find the highest scoring path?

Compute the following matrices (DP)
M(i, j) most likely alignment of x1xi with
y1yj ending in state M
I(i, j) most likely alignment of x1xi with
y1yj ending in state I
J(i, j) most likely alignment of x1xi with
y1yj ending in state J
M(i, j) log( Prob(xi, yj) )
max M(i-1, j-1) log(1-2?),
I(i-1, j) log(1-?),
J(i, j-1) log(1-?)
I(i, j) max M(i-1, j) log ?,
I(i-1, j) log ?

log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
37
The Viterbi algorithm for alignment

For each i 1, , M
For each j 1, , N
M(i, j) log( Prob(xi, yj) )
max M(i-1, j-1) log(1-2?),
I(i-1, j) log(1-?),
J(i, j-1) log(1-?)
I(i, j) max M(i-1, j) log ?,
I(i-1, j) log ?
J(i, j) max M(i-1, j) log ?,
I(i-1, j) log ?
When matrices are filled, we can trace back from
(M, N) the likeliest alignment

38
One way to view the state paths State M

y1
yn
x1

xm
39
State I

y1
yn
x1

xm
40
State J

y1
yn
x1

xm
41
Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1)

y1
yn
x1

xm
42
Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1) Optimal
solution is the best scoring path from top-left
to bottom-right corner This gives the likeliest
alignment according to our HMM

y1
yn
x1

xm
43
Yet another way to represent this model
Ix
Ix
BEGIN
END
Iy
Iy
Mx1
Mxm
Sequence X
We are aligning, or threading, sequence Y through
sequence X Every time yj lands in state xi, we
get substitution score s(xi, yj) Every time yj
is gapped, or some xi is skipped, we pay gap
penalty

Write a Comment

User Comments (0)