Title: Profile Hidden Markov Models
1Profile Hidden Markov Models
- Bioinformatics Fall-2004
- Dr Webb Miller and Dr Claude Depamphilis
- Dhiraj Joshi
- Department of Computer Science and Engineering
- The Pennsylvania State University
2Outline
- Introduction to HMMs
- Profile HMMs
- Available resources for Profile HMMs
- Some online demonstrations
3Introduction to HMMs
- Hidden Markov Models Formalism
- statistical techniques for modeling patterns in
data - First order Markov property - memorylessness
- state generally a hidden entity which spawns
symbols or features - the same symbol could be emitted by several
states - HMM characterized by transition probabilities and
emission distribution
4Introduction to HMMs
- Hidden Markov Models Parameter Estimation
- Parameters- transition probabilities and emission
probabilities - iterative computational algorithms used
- EM algorithm, Viterbi algorithm
- algorithms based on dynamic programming to save
computational cost - usually the iterations involve variants of the
following two steps - estimate state sequence which maximizes
likelihood under a parameter set - update parameter set based on the estimated state
sequence - algorithms converge to local optima sometimes
5Outline
- Introduction to HMMs
- Profile HMMs
- Available resources for Profile HMMs
- Some online demonstrations
6Profile Hidden Markov Models
- Stochastic methods to model multiple sequence
alignments proteins and dna sequences - Potential application domains
- protein families could be modeled as an HMM or a
group of HMMs - constructing a profile HMM
- new protein sequences could be aligned with
stored models to detect remote homology - aligning a sequence with a stored profile HMM
- align two or more protein family profile HMMs to
detect homology - finding statistical similarities between two
profile HMM models
7Profile Hidden Markov Models
- Constructing a profile HMM
- A multiple sequence alignment assumed
- each consensus column can exist in 3 states
- match, insert and delete states
- number of states depends upon length of the
alignment
8Profile Hidden Markov Models
- A typical profile HMM architecture
- squares represent match states
- diamonds represent insert states
- circles represent delete states
- arrows represent transitions
9Profile Hidden Markov Models
- A typical profile HMM architecture
- transition between match states -
- transition from match state to insert state -
- transition within insert state -
- transition from match state to delete state -
- transition within delete state -
- emission of symbol at a state -
10Profile Hidden Markov Models
- Estimation of parameters
- transition probabilities estimated as frequency
of a transition in a given alignment - emission probabilities estimated as frequency of
an emission in a given alignment - pseudo counts usually introduced to account for
transititions / emissions which were not present
in the alignment
11Profile Hidden Markov Models
- Estimation of parameters
- with pseudo counts
- Dirichlet prior distribution used to determine
pseudo counts
12Profile Hidden Markov Models
- Scoring a sequence against a profile HMM
- Viterbi algorithm used to find the best state
path - Simulated annealing based methods also used
- Maximization criteria log likelihood or log
odds - Log likelihood score generally depends on length
of sequence and hence not preferred - If an alignment not given initially, the
alignment could be learnt iteratively using
Viterbi
13Profile Hidden Markov Models
- Comparing two profile HMMs
- Profile-profile comparison tool based on
information theory - based on Kullback-Leibler divergence criterion
for comparing 2 statistical distributions - dynamic programming used to compare entire
profiles - detect weak similarities between models
14Outline
- Introduction to HMMs
- Profile HMMs
- Available resources for Profile HMMs
- Some online demonstrations
15Available resources for Profile HMMs
- HMMER and SAM one of the first available programs
for profile HMMs - HMMER S Eddy at Washington University
- SAM Sequence alignment and Modeling System
- R. Hughey at University of
California, Santa Cruz - available free for research
- SAM has online servers to perform sequence
comparisons - http//www.cse.ucsc.edu/research/co
mpbio/sam.html
16Available resources for Profile HMMs
- InterPro consortium in Europe has many resources
for protein data - Database of protein families and domains
- Brings together several different databases under
one umbrella - Pfam and Superfamily are profile HMM libraries
associated with Interpro - Pfam based on HMMER search and Superfamily based
on SAM search and modeling
17Available resources for Profile HMMs
- SAMs iterative approach for building HMM
- find a set of close homologs using BLASTP
- learn the alignment and build model using close
homologs - use BLASTP to get more remote homologs using the
first set of sequences (relax the E value) - iteratively refine the HMM model
- SAM uses Dirichlet priors as pseudo counts for
parameters - Hand tuned seed alignments not required as the
alignments are learnt by the algorithm unlike
HMMER
18Available resources for Profile HMMs
- SUPERFAMILY database incorporates
- library of profile HMMs representing all proteins
of known structure - assignments to predicted proteins from all
completely sequenced genomes - search and alignment services
- models and domain assignments are freely
available - Based on SCOP classification of protein domains
- SAM HMM iterative procedure used for model
building and sequence alignment
19Available resources for Profile HMMs
- In Superfamily
- Each SCOP superfamily is represented as an HMM
model - Model built using SAM procedure based 4 variants
- accurate structure based alignments
- hand labeled alignments
- autonomic alignments using ClustalW
- sequence members used separately as seeds
- Assignment of superfamilies
- for a given sequence, every model is scored
across the whole sequence using Viterbi scoring - model which scores highest has its superfamily
assigned to the region
20Outline
- Introduction to HMMs
- Profile HMMs
- Available resources for Profile HMMs
- Some online demonstrations
21Online Demonstrations
- http//supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/temp/6
24288710157514.html
22References
- Durbin. R, Eddy. S, Krough. A, and Mitchenson. G,
Biological Sequence Analysis, Cambridge
University Press, 2002 - Baldi. P and Brunak. S, Bioinformatics, the
Machine Learning Approach, the MIT Press,
Cambridge, 1998 - Eddy. S, Profile Hidden Markov Models,
Bioinformatics Review, vol. 19, no. 8, pp.
755-763, 1998 - Karplus. K, Barrett. C, and Hughey. R, Hidden
Markov models for detecting remote homologies,
Bioinformatics, vol. 14, no. 10, pp. 846-856,
1998 - Madera. M, Gough, J, A comparison of profile
hidden Markov model procedures for remote
homology detection, Nucleic Acids Research,
vol. 30, no. 19, pp. 4321-4328, 2002 - Gough. J, Karplus. K, Hughey. R, and Chothia. C,
Assignment of Homology to Genome Sequences
using a Library of Hidden Markov Models that
represent all Proteins of known structure, J.
Mol. Biol., 313, pp. 903-919, 2001
23References
- Yona. G, Levitt. M, Within the Twilight Zone A
sensitive Profile-Profile comparison tool based
on Information Theory, J. Mol. Biol., 315,
1257-1275, 2002 - Mandera. M, Vogel. C, Kummerfeld. K, Chothia. C,
and Gough. J, The SUPERFAMILY database in 2004
additions and improvements, Nucleic Acids
Research, vol. 32, Database Issue, D235-239, 2004 - Bateman. A, Birney. E, Durbin. R, Eddy. S, Finn.
R, Sonnhammer. E, Pfam 3.1 1313 multiple
alignments and profile HMMs match the majority of
proteins, Nucleic Acids Research, vol. 27, no.
1, 1999 - Andreeva. A, et. al., SCOP database in 2004
refinements integrate structure and sequence
family data, Nucleic Acids Research, vol. 32,
Database Issue, D226-D229,2004 - Many other online resources and tutorials