Profiles and Hidden Markov Models - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Profiles and Hidden Markov Models

Description:

... methods work by providing a statistical frame where the probability of residues ... In DNA, there are four symbols or states: G, A, T and C (20 in proteins) ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 25

Provided by: patt86

Category:

more less

Transcript and Presenter's Notes

Title: Profiles and Hidden Markov Models

1
Chapter 6 Profiles and Hidden Markov Models
2

The following approaches can also be used to
identify distantly related members to a family of
protein (or DNA) sequences
Position-specific scoring matrix (PSSM)
Profile
Hidden Markov Model
These methods work by providing a statistical
frame where the probability of residues or
nucleotides at specific sequences are tested
Thus, in multiple alignments, information on all
the members in the alignment is retained.

3
Position-specific scoring matrices
Position 1 2 3 4 5 6 Sequence 1 A T G T C
G Sequence 2 A A G A C T Sequence 3 T A C T C
A Sequence 4 C G G A G G Sequence 5 A A C C T G
Frequencies of observations in a position
Converted to log2
Normalised to overall frequencies
4
Match AACTCG to the PSSM matrix

1.01.00.81.01.381.15 6.33
26.33 80
Thus, the sequence AACTCG is 80 times more likely
to fit than a random 6 nucleotide sequence

5
Profiles

Profiles are PSSMs that include gap penalty
information
This is not a trivial problem, and is
incorporated in Position specific iterated (PSI)
BLAST
A normal BLASTP is performed with the query
sequence, homologs obtained, and a multiple
alignment performed
A Profile is based on this alignment
The profile is used to search the database again,
and a new profile is created by adding in newly
identified homologs
This process is repeated until no new homologs
are identified
PSI-BLAST very sensitive approach to search for
distant relatives of a family
High sensitivity can generate high false positive
count
Inclusion of false positives can lead to profile
drift
User can visually inspect each iteration result
to decide on inclusion of sequences
Typically 3-5 iterations sufficient to identiofy
distant homologs

6
Markov Model and Hidden Markov Model
1
2
3
4
5
P12
P23
P34
P45

A Markov chain described a series of events or
states
There is a certain probability to move from one
state to the next state
This is known as the transition probability
Sequences can also be seen as Markov chains where
the occurrence of a given nucleotide may depend
on the preceding nucleotide
Zero order Markov model described a state that is
independent of a previous state
First order Markov model state is dependent on
direct precursor (i.e., di-nucleotide sequences)
Second order Markov model, depends on three
nucleotides, for example codons
Thus frequency of transitions in tri-mers may be
different in coding and non-coding regions of the
genome
The Markov model is therefore applicable to
finding genes in genomes
In a Markov model all states are observable

7
Hidden Markov model
Observable states
P23
P34
2
3
4
P12
P45
1
5
Begin state
End state
P12
P45
2
3
4
Hidden states
P23
P34
A Markov model may consist of observable states
and unobservable or hidden states The hidden
states also affect the outcome of the observed
states In a sequence alignment, a gap is an
unobserved state that influences to probability
of the next nucleotide The probability of going
from one state to the next state is called the
transition probability In DNA, there are four
symbols or states G, A, T and C (20 in
proteins) The probability value associated with
each symbol is the emission probability To
calculate the probability of a particular path,
the transition and emission probabilities of all
possible paths have to be considered
8
A simple two state example
Emission probability
Transition probability
0.40
State 1
State 2
This particular Markov model has a probability of
0.80 X 0.40 X 0.32 0.102to generate the
sequence AG This particular model shows that the
sequence AT has the highest probability to
occur Where do these numbers come from? A Markov
model has to be trained with examples
9
Training

The frequencies of occurrence of nucleotides in a
multiply aligned sequence is used to calculate
the emission and transition probabilities of each
symbol at each state
The trained HMM is then used to test how well a
new sequence fits to the model
The use a HMM for gaps sequence alignments, a
state can either be a
match/mismatch (mismatch is low probability
match) (observable)
Insertion (hidden)
Deletion (hidden)

I1
I2
I3
I4
B
M1
M2
M3
E
D1
D2
D3
There is one optimal path from B to E that
describes the most probable sequence and the
optimal alignment to the multiply aligned
sequence family
10
Viterbi algorithm
11
A brief interlude, looking at algorithms
Bubblesort
0

The algorithm
Two loops
The outer loop starts at index max-1 and
decrements by -1 with every loop
The inner loop starts at 0 and increments by 1
to the value of the outer loop
Compare values at index and at index1 in the
inner loop
If valueindexltvalueindex1, swap them
Continue until outer loop is 1

max
12
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outerloop9
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Smallest number is now at the bottom
13
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outerloop9
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Outer loop 9 Inner loop 0
Next smallest number is now at the bottom-1
14
Python code for Bubblesort algorithm
import random def bubblesort(list_of_numbers)
for outer_loop in range(len(list_of_numbers)-1,
0, -1) for index in range(outer_loop)
if list_of_numbersindex lt
list_of_numbersindex 1
temporary list_of_numbersindex
list_of_numbersindex list_of_numbersindex
1 list_of_numbersindex 1
temporary return list_of_numbers numbersrang
e(10) get a list of numbers from 0 to
9 random.shuffle(numbers) shuffle the
numbers print "In random order ", numbers print
"In order ", bubblesort(numbers)
15
Quicksort
def qsort2(L) if len(L)lt1 return
L pivotL0 less x for x in L if
xltpivot equal x for x in L if xpivot
greater x for x in L if xgtpivot return
qsort2(less)equalqsort2(greater)
16
Applications of HMMs

HMMs include predictive information of insertions
and deletions separately
Not arbitrary gap penalties
Once HMMs are trained, can be used to identify
distant family members in a database
Can be used for protein family classification
Advanced gene and promoter prediction
Transmembrane protein prediction
Protein fold recognition
Nucleosome positions
HMMer (http//hmmer.wustl.edu/) suite of linux
programs
hmmalign, aligns sequences to an HMM profile.
hmmbuild, build a hidden Markov model from an
alignment.
hmmcalibrate, calibrate HMM search statistics.
hmmconvert, convert between profile HMM file
formats.
hmmemit, generate sequences from a profile HMM.
hmmfetch, retrieve specific HMM from an HMM
database.
hmmindex, create SSI index for an HMM database.
hmmpfam, search one or more sequences against HMM
database.
hmmsearch, search a sequence database with a
profile HMM.

17
Chapter 7 Protein Motif and Domain Prediction
18

A motif is a conserved sequence 10-230 aa long
Eg. Zn-finger motif
Domain is 40-700 aa in length
Eg. transmembrane domain
Motifs and domains are often evolutionally
conserved
Useful to identify functions of proteins that
should little homology over full sequence
Motifs and domains often identified by PSSM and
HMMs
Motifs or domains can be stored in a database
Unknown proteins can be matched to this database
to identify motifs and domains and illuminate
possible protein fundctions
Motifs domains can be stored as
regular expression (ST-X-RK)
Or as PSSM or HMMs

19
Regular expressions

E-X(2)-FHM-X(4)-P-L
Invariant
Conserved in square brackets
Disallowed in curly brackets
Nonspecific shown by X
Repetions by number in round () brackets
PROSITE (http//expasy.org/prosite/)
High number of false negatives
Database must be continually updated
PSSM, profiles and HMMs incorporate statistical
information and are much more accurate

20
(No Transcript)
21
PRINTS Matches smaller regions of a motifs called
fingerprints to query http//www.bioinf.manchest
er.ac.uk/dbbrowser/ BLOCKS PSSM or aligned
sequences used to define blocks that are larger
than motifs http//blocks.fhcrc.org/blocks/ ProDo
m Database generated with PSI-BLAST http//prodom.
prabi.fr/prodom/current/html/home.php Pfam Contai
ns HMMs of seeded smaller alignment from
SWISSPROT and trEMBL http//pfam.sanger.ac.uk/ SM
ART Database of HMMs based on manual structural
alignments or PSI-BLAST profiles http//smart.embl
-heidelberg.de/
22
Protein family databases
COG (Cluster of orthologous groups) All against
all comparison of all sequenced genomes If best
fit is obtained in prokaryotes, archeae and
eukaryotes, defined as cluster Clusters can be
searched to identify possible function of unknown
protein http//www.ncbi.nlm.nih.gov/COG/ ProtoNe
t Pairwise BLAST alignment of all protein
sequences in SWISSPROT Query sequence searched
against this database http//www.protonet.cs.huji.
ac.il/
23
Finding distant/little conserved motifs
Expectation Maximization Use predicted alignment
of sequences Calculate PSSM Iterate over used
sequences and modify PSSM to better fit each in
turn Gibbs Motif sampling Use estimated
alignment of all but one sequence Calculate
PSSM Recalculate PSSM with one left-out
sequence Iterate process to convergence setting
24
Weblogo
Graphical representation of the motif
sequence Highly conserved residues are shown as
larger symbols Ambiguity indicated http//weblogo
.berkeley.edu/logo.cgi Helix-turn-helix motif of
E. coli CAP family protein

Write a Comment

User Comments (0)