Sequence Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence Similarity

Description:

Sequence Similarity Why sequence similarity structural similarity 25% sequence identity similar structure evolutionary relationship all proteins come from – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 44
Provided by: root
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Sequence Similarity


1
Sequence Similarity
2
Why sequence similarity
  • structural similarity
  • gt25 sequence identity ? similar structure
  • evolutionary relationship
  • all proteins come from lt 2000 (super)families
  • related functional role
  • similar structure ? similar function
  • functional modules are often preserved

3
Muscle cells and contraction
4
Actin and myosin during muscle movement
5
Actin structure
6
Actin sequence
  • Actin is ancient and abundant
  • Most abundant protein in cells
  • 1-2 actin genes in bacteria, yeasts, amoebas
  • Humans 6 actin genes
  • ?-actin in muscles ?-actin, ?-actin in
    non-muscle cells
  • 4 amino acids different between each version
  • MUSCLE ACTIN Amino Acid Sequence
  • 1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG
    VMVGMGQKDS YVGDEAQSKR
  • 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP
    VLLTEAPLNP KANREKMTQI
  • 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN
    VPIYEGYALP HAIMRLDLAG
  • 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD
    FEQEMATAAS SSSLEKSYEL
  • 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI
    MKCDIDIRKD LYANNVLSGG
  • 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG
    GSILASLSTF QQMWITKQEY
  • 361 DESGPSIVHR KCF

7
A related protein in bacteria
8
Relation between sequence and structure
9
A multiple alignment of actins
10
Gene expression
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
11
Biomolecules as Strings
  • Macromolecules are the chemical building blocks
    of cells
  • Proteins
  • 20 amino acids
  • Nucleic acids
  • 4 nucleotides A, C, G, ,T
  • Polysaccharides

12
The information is in the sequence
  • Sequence ? Structure ? Function
  • Sequence similarity
  • ? Structural and/or Functional similarity
  • Nucleic acids and proteins are related by
    molecular evolution
  • Orthologs two proteins in animals X and Y that
    evolved from one protein in immediate ancestor
    animal Z
  • Paralogs two proteins that evolved from one
    protein through duplication in some ancestor
  • Homologs orthologs or paralogs that exhibit
    sequence similarity

13
Protein Phylogenies
  • Proteins evolve by both duplication and species
    divergence

duplication
orthologs
paralogs
14
Evolution
15
Evolution at the DNA level
Mutation
Deletion
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC - - - - CAGTCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
16
Evolutionary Rates



next generation
OK



OK



OK



Changes in non-functional sites are OK, so will
be propagated
X



X



Still OK?



Most changes in functional sites are deleterious
and will be rejected
17
Sequence conservation implies function
Proteins between humans and rodents are on
average 85 identical
18
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, M in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
19
What is a good alignment?
  • Alignment
  • The best way to match the letters of one
    sequence with those of the other
  • How do we define best?
  • Alignment
  • A hypothesis that the two sequences come from a
    common ancestor through sequence edits
  • Parsimonious explanation
  • Find the minimum number of edits that transform
    one sequence into the other

20
Scoring Function
  • Sequence edits
  • AGGCCTC
  • Mutations
  • AGGACTC
  • Insertions
  • AGGGCCTC
  • Deletions
  • AGGCTC
  • Scoring Function
  • Match m
  • Mismatch s
  • Gap d
  • Score F ( matches) ? m ( mismatches) ? s
    (gaps) ? d

21
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible alignments O( 2MN)
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
22
Alignment is additive
  • Observation
  • The score of aligning x1xM
  • y1yN
  • is additive
  • Say that x1xi xi1xM
  • aligns to y1yj yj1yN
  • The two scores add up
  • F(x1M, y1N) F(x1i, y1j)
    F(xi1M, yj1N)
  • Key property optimal solution to the entire
    problem is composed of optimal solutions to
    subproblems Dynamic Programming

23
Dynamic Programming
  • Construct a DP matrix F MxN
  • Suppose we wish to align
  • x1xM
  • y1yN
  • Let
  • F(i, j) optimal score of aligning
  • x1xi
  • y1yj

24
Dynamic Programming (contd)
  • Notice three possible cases
  • xi aligns to yj
  • x1xi-1 xi
  • y1yj-1 yj
  • 2. xi aligns to a gap
  • x1xi-1 xi
  • y1yj -
  • yj aligns to a gap
  • x1xi -
  • y1yj-1 yj

m, if xi yj F(i, j) F(i-1, j-1)
-s, if not
F(i, j) F(i-1, j) d
F(i, j) F(i, j-1) d
25
Dynamic Programming (contd)
  • How do we know which case is correct?
  • Inductive assumption
  • F(i, j 1), F(i 1, j), F(i 1, j 1) are
    optimal
  • Then,
  • F(i 1, j 1) s(xi, yj)
  • F(i, j) max F(i 1, j) d
  • F(i, j 1) d
  • Where s(xi, yj) m, if xi yj -s, if not

26
Example
  • x AGTA m 1
  • y ATA s -1
  • d -1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4, 3) 2 AGTA A - TA
j 0
1
2
3
27
The Needleman-Wunsch Algorithm
  • Initialization.
  • F(0, 0) 0
  • F(0, j) - j ? d
  • F(i, 0) - i ? d
  • Main Iteration. Filling-in partial alignments
  • For each i 1M
  • For each j 1N
  • F(i-1,j-1) s(xi, yj) case 1
  • F(i, j) max F(i-1, j) d case
    2
  • F(i, j-1) d case 3
  • DIAG, if case 1
  • Ptr(i,j) LEFT, if case 2
  • UP, if case 3
  • Termination. F(M, N) is the optimal score, and
  • from Ptr(M, N) can trace back optimal alignment

28
Performance
  • Time
  • O(NM)
  • Space
  • O(NM)
  • Possible to reduce space to O(NM) using
    Hirschbergs divide conquer algorithm

29
Substitutions of Amino Acids
  • Mutation rates between amino acids have dramatic
    differences!

How can we quantify the differences in rates by
which one amino acid replaces another across
related proteins?
30
Substitution Matrices
  • BLOSUM matrices
  • Start from BLOCKS database (curated, gap-free
    alignments)
  • Cluster sequences according to gt X identity
  • Calculate Aab of aligned a-b in distinct
    clusters, correcting by 1/mn, where m, n are the
    two cluster sizes
  • Estimate
  • P(a) (?b Aab)/(?cd Acd) P(a, b) Aab/(?cd
    Acd)

31
Gaps are not inserted uniformly
32
A state model for alignment
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
33
Lets score the transitions
s(xi, yj)
M (1,1)
s(xi, yj)
s(xi, yj)
Alignments correspond 1-to-1 with sequences of
states M, I, J
-d
-d
I (1, 0)
J (0, 1)
-e
-e
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
34
A probabilistic model for alignment
  • Assign probabilities to every transition (arrow),
    and emission (pair of letters or gaps)
  • Probabilities of mutation reflect amino acid
    similarities
  • Different probabilities for opening and
    extending gap

M (1,1)
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
35
A Pair HMM for alignments
log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
log ?
log ?
I P(xi)
J P(yj)
Highest scoring path corresponds to the most
likely alignment!
36
How do we find the highest scoring path?
  • Compute the following matrices (DP)
  • M(i, j) most likely alignment of x1xi with
    y1yj ending in state M
  • I(i, j) most likely alignment of x1xi with
    y1yj ending in state I
  • J(i, j) most likely alignment of x1xi with
    y1yj ending in state J
  • M(i, j) log( Prob(xi, yj) )
  • max M(i-1, j-1) log(1-2?),
  • I(i-1, j) log(1-?),
  • J(i, j-1) log(1-?)
  • I(i, j) max M(i-1, j) log ?,
  • I(i-1, j) log ?

log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
37
The Viterbi algorithm for alignment
  • For each i 1, , M
  • For each j 1, , N
  • M(i, j) log( Prob(xi, yj) )
  • max M(i-1, j-1) log(1-2?),
  • I(i-1, j) log(1-?),
  • J(i, j-1) log(1-?)
  • I(i, j) max M(i-1, j) log ?,
  • I(i-1, j) log ?
  • J(i, j) max M(i-1, j) log ?,
  • I(i-1, j) log ?
  • When matrices are filled, we can trace back from
    (M, N) the likeliest alignment

38
One way to view the state paths State M

y1
yn
x1

xm
39
State I

y1
yn
x1

xm
40
State J

y1
yn
x1

xm
41
Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1)

y1
yn
x1

xm
42
Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1) Optimal
solution is the best scoring path from top-left
to bottom-right corner This gives the likeliest
alignment according to our HMM

y1
yn
x1

xm
43
Yet another way to represent this model
Ix
Ix
BEGIN
END
Iy
Iy
Mx1
Mxm
Sequence X
We are aligning, or threading, sequence Y through
sequence X Every time yj lands in state xi, we
get substitution score s(xi, yj) Every time yj
is gapped, or some xi is skipped, we pay gap
penalty
Write a Comment
User Comments (0)
About PowerShow.com