Title: Profiles and Hidden Markov Models
1Chapter 6 Profiles and Hidden Markov Models
2- The following approaches can also be used to
identify distantly related members to a family of
protein (or DNA) sequences - Position-specific scoring matrix (PSSM)
- Profile
- Hidden Markov Model
- These methods work by providing a statistical
frame where the probability of residues or
nucleotides at specific sequences are tested - Thus, in multiple alignments, information on all
the members in the alignment is retained.
3Position-specific scoring matrices
Position 1 2 3 4 5 6 Sequence 1 A T G T C
G Sequence 2 A A G A C T Sequence 3 T A C T C
A Sequence 4 C G G A G G Sequence 5 A A C C T G
Frequencies of observations in a position
Converted to log2
Normalised to overall frequencies
4Match AACTCG to the PSSM matrix
- 1.01.00.81.01.381.15 6.33
- 26.33 80
- Thus, the sequence AACTCG is 80 times more likely
to fit than a random 6 nucleotide sequence
5Profiles
- Profiles are PSSMs that include gap penalty
information - This is not a trivial problem, and is
incorporated in Position specific iterated (PSI)
BLAST - A normal BLASTP is performed with the query
sequence, homologs obtained, and a multiple
alignment performed - A Profile is based on this alignment
- The profile is used to search the database again,
and a new profile is created by adding in newly
identified homologs - This process is repeated until no new homologs
are identified - PSI-BLAST very sensitive approach to search for
distant relatives of a family - High sensitivity can generate high false positive
count - Inclusion of false positives can lead to profile
drift - User can visually inspect each iteration result
to decide on inclusion of sequences - Typically 3-5 iterations sufficient to identiofy
distant homologs
6Markov Model and Hidden Markov Model
1
2
3
4
5
P12
P23
P34
P45
- A Markov chain described a series of events or
states - There is a certain probability to move from one
state to the next state - This is known as the transition probability
- Sequences can also be seen as Markov chains where
the occurrence of a given nucleotide may depend
on the preceding nucleotide - Zero order Markov model described a state that is
independent of a previous state - First order Markov model state is dependent on
direct precursor (i.e., di-nucleotide sequences) - Second order Markov model, depends on three
nucleotides, for example codons - Thus frequency of transitions in tri-mers may be
different in coding and non-coding regions of the
genome - The Markov model is therefore applicable to
finding genes in genomes - In a Markov model all states are observable
7Hidden Markov model
Observable states
P23
P34
2
3
4
P12
P45
1
5
Begin state
End state
P12
P45
2
3
4
Hidden states
P23
P34
A Markov model may consist of observable states
and unobservable or hidden states The hidden
states also affect the outcome of the observed
states In a sequence alignment, a gap is an
unobserved state that influences to probability
of the next nucleotide The probability of going
from one state to the next state is called the
transition probability In DNA, there are four
symbols or states G, A, T and C (20 in
proteins) The probability value associated with
each symbol is the emission probability To
calculate the probability of a particular path,
the transition and emission probabilities of all
possible paths have to be considered
8A simple two state example
Emission probability
Transition probability
0.40
State 1
State 2
This particular Markov model has a probability of
0.80 X 0.40 X 0.32 0.102to generate the
sequence AG This particular model shows that the
sequence AT has the highest probability to
occur Where do these numbers come from? A Markov
model has to be trained with examples
9Training
- The frequencies of occurrence of nucleotides in a
multiply aligned sequence is used to calculate
the emission and transition probabilities of each
symbol at each state - The trained HMM is then used to test how well a
new sequence fits to the model - The use a HMM for gaps sequence alignments, a
state can either be a - match/mismatch (mismatch is low probability
match) (observable) - Insertion (hidden)
- Deletion (hidden)
I1
I2
I3
I4
B
M1
M2
M3
E
D1
D2
D3
There is one optimal path from B to E that
describes the most probable sequence and the
optimal alignment to the multiply aligned
sequence family
10Viterbi algorithm
11A brief interlude, looking at algorithms
Bubblesort
0
- The algorithm
- Two loops
- The outer loop starts at index max-1 and
decrements by -1 with every loop - The inner loop starts at 0 and increments by 1
to the value of the outer loop - Compare values at index and at index1 in the
inner loop - If valueindexltvalueindex1, swap them
- Continue until outer loop is 1
max
12Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outerloop9
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Smallest number is now at the bottom
13Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outerloop9
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Next smallest number is now at the bottom-1
14Python code for Bubblesort algorithm
import random def bubblesort(list_of_numbers)
for outer_loop in range(len(list_of_numbers)-1,
0, -1) for index in range(outer_loop)
if list_of_numbersindex lt
list_of_numbersindex 1
temporary list_of_numbersindex
list_of_numbersindex list_of_numbersindex
1 list_of_numbersindex 1
temporary return list_of_numbers numbersrang
e(10) get a list of numbers from 0 to
9 random.shuffle(numbers) shuffle the
numbers print "In random order ", numbers print
"In order ", bubblesort(numbers)
15Quicksort
def qsort2(L) if len(L)lt1 return
L pivotL0 less x for x in L if
xltpivot equal x for x in L if xpivot
greater x for x in L if xgtpivot return
qsort2(less)equalqsort2(greater)
16Applications of HMMs
- HMMs include predictive information of insertions
and deletions separately - Not arbitrary gap penalties
- Once HMMs are trained, can be used to identify
distant family members in a database - Can be used for protein family classification
- Advanced gene and promoter prediction
- Transmembrane protein prediction
- Protein fold recognition
- Nucleosome positions
- HMMer (http//hmmer.wustl.edu/) suite of linux
programs - hmmalign, aligns sequences to an HMM profile.
- hmmbuild, build a hidden Markov model from an
alignment. - hmmcalibrate, calibrate HMM search statistics.
- hmmconvert, convert between profile HMM file
formats. - hmmemit, generate sequences from a profile HMM.
- hmmfetch, retrieve specific HMM from an HMM
database. - hmmindex, create SSI index for an HMM database.
- hmmpfam, search one or more sequences against HMM
database. - hmmsearch, search a sequence database with a
profile HMM.
17Chapter 7 Protein Motif and Domain Prediction
18- A motif is a conserved sequence 10-230 aa long
- Eg. Zn-finger motif
- Domain is 40-700 aa in length
- Eg. transmembrane domain
- Motifs and domains are often evolutionally
conserved - Useful to identify functions of proteins that
should little homology over full sequence - Motifs and domains often identified by PSSM and
HMMs - Motifs or domains can be stored in a database
- Unknown proteins can be matched to this database
to identify motifs and domains and illuminate
possible protein fundctions - Motifs domains can be stored as
- regular expression (ST-X-RK)
- Or as PSSM or HMMs
19Regular expressions
- E-X(2)-FHM-X(4)-P-L
- Invariant
- Conserved in square brackets
- Disallowed in curly brackets
- Nonspecific shown by X
- Repetions by number in round () brackets
- PROSITE (http//expasy.org/prosite/)
- High number of false negatives
- Database must be continually updated
- PSSM, profiles and HMMs incorporate statistical
information and are much more accurate
20(No Transcript)
21PRINTS Matches smaller regions of a motifs called
fingerprints to query http//www.bioinf.manchest
er.ac.uk/dbbrowser/ BLOCKS PSSM or aligned
sequences used to define blocks that are larger
than motifs http//blocks.fhcrc.org/blocks/ ProDo
m Database generated with PSI-BLAST http//prodom.
prabi.fr/prodom/current/html/home.php Pfam Contai
ns HMMs of seeded smaller alignment from
SWISSPROT and trEMBL http//pfam.sanger.ac.uk/ SM
ART Database of HMMs based on manual structural
alignments or PSI-BLAST profiles http//smart.embl
-heidelberg.de/
22Protein family databases
COG (Cluster of orthologous groups) All against
all comparison of all sequenced genomes If best
fit is obtained in prokaryotes, archeae and
eukaryotes, defined as cluster Clusters can be
searched to identify possible function of unknown
protein http//www.ncbi.nlm.nih.gov/COG/ ProtoNe
t Pairwise BLAST alignment of all protein
sequences in SWISSPROT Query sequence searched
against this database http//www.protonet.cs.huji.
ac.il/
23Finding distant/little conserved motifs
Expectation Maximization Use predicted alignment
of sequences Calculate PSSM Iterate over used
sequences and modify PSSM to better fit each in
turn Gibbs Motif sampling Use estimated
alignment of all but one sequence Calculate
PSSM Recalculate PSSM with one left-out
sequence Iterate process to convergence setting
24Weblogo
Graphical representation of the motif
sequence Highly conserved residues are shown as
larger symbols Ambiguity indicated http//weblogo
.berkeley.edu/logo.cgi Helix-turn-helix motif of
E. coli CAP family protein