Profile Hidden Markov Models


Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering – PowerPoint PPT presentation

Slides: 24
Title: Profile Hidden Markov Models

Profile Hidden Markov Models
  • Bioinformatics Fall-2004
  • Dr Webb Miller and Dr Claude Depamphilis
  • Dhiraj Joshi
  • Department of Computer Science and Engineering
  • The Pennsylvania State University

  • Introduction to HMMs
  • Profile HMMs
  • Available resources for Profile HMMs
  • Some online demonstrations

Introduction to HMMs
  • Hidden Markov Models Formalism
  • statistical techniques for modeling patterns in
  • First order Markov property - memorylessness
  • state generally a hidden entity which spawns
    symbols or features
  • the same symbol could be emitted by several
  • HMM characterized by transition probabilities and
    emission distribution

Introduction to HMMs
  • Hidden Markov Models Parameter Estimation
  • Parameters- transition probabilities and emission
  • iterative computational algorithms used
  • EM algorithm, Viterbi algorithm
  • algorithms based on dynamic programming to save
    computational cost
  • usually the iterations involve variants of the
    following two steps
  • estimate state sequence which maximizes
    likelihood under a parameter set
  • update parameter set based on the estimated state
  • algorithms converge to local optima sometimes

  • Introduction to HMMs
  • Profile HMMs
  • Available resources for Profile HMMs
  • Some online demonstrations

Profile Hidden Markov Models
  • Stochastic methods to model multiple sequence
    alignments proteins and dna sequences
  • Potential application domains
  • protein families could be modeled as an HMM or a
    group of HMMs
  • constructing a profile HMM
  • new protein sequences could be aligned with
    stored models to detect remote homology
  • aligning a sequence with a stored profile HMM
  • align two or more protein family profile HMMs to
    detect homology
  • finding statistical similarities between two
    profile HMM models

Profile Hidden Markov Models
  • Constructing a profile HMM
  • A multiple sequence alignment assumed
  • each consensus column can exist in 3 states
  • match, insert and delete states
  • number of states depends upon length of the

Profile Hidden Markov Models
  • A typical profile HMM architecture
  • squares represent match states
  • diamonds represent insert states
  • circles represent delete states
  • arrows represent transitions

Profile Hidden Markov Models
  • A typical profile HMM architecture
  • transition between match states -
  • transition from match state to insert state -
  • transition within insert state -
  • transition from match state to delete state -
  • transition within delete state -
  • emission of symbol at a state -

Profile Hidden Markov Models
  • Estimation of parameters
  • transition probabilities estimated as frequency
    of a transition in a given alignment
  • emission probabilities estimated as frequency of
    an emission in a given alignment
  • pseudo counts usually introduced to account for
    transititions / emissions which were not present
    in the alignment

Profile Hidden Markov Models
  • Estimation of parameters
  • with pseudo counts
  • Dirichlet prior distribution used to determine
    pseudo counts

Profile Hidden Markov Models
  • Scoring a sequence against a profile HMM
  • Viterbi algorithm used to find the best state
  • Simulated annealing based methods also used
  • Maximization criteria log likelihood or log
  • Log likelihood score generally depends on length
    of sequence and hence not preferred
  • If an alignment not given initially, the
    alignment could be learnt iteratively using

Profile Hidden Markov Models
  • Comparing two profile HMMs
  • Profile-profile comparison tool based on
    information theory
  • based on Kullback-Leibler divergence criterion
    for comparing 2 statistical distributions
  • dynamic programming used to compare entire
  • detect weak similarities between models

  • Introduction to HMMs
  • Profile HMMs
  • Available resources for Profile HMMs
  • Some online demonstrations

Available resources for Profile HMMs
  • HMMER and SAM one of the first available programs
    for profile HMMs
  • HMMER S Eddy at Washington University
  • SAM Sequence alignment and Modeling System
  • R. Hughey at University of
    California, Santa Cruz
  • available free for research
  • SAM has online servers to perform sequence
  • http//

Available resources for Profile HMMs
  • InterPro consortium in Europe has many resources
    for protein data
  • Database of protein families and domains
  • Brings together several different databases under
    one umbrella
  • Pfam and Superfamily are profile HMM libraries
    associated with Interpro
  • Pfam based on HMMER search and Superfamily based
    on SAM search and modeling

Available resources for Profile HMMs
  • SAMs iterative approach for building HMM
  • find a set of close homologs using BLASTP
  • learn the alignment and build model using close
  • use BLASTP to get more remote homologs using the
    first set of sequences (relax the E value)
  • iteratively refine the HMM model
  • SAM uses Dirichlet priors as pseudo counts for
  • Hand tuned seed alignments not required as the
    alignments are learnt by the algorithm unlike

Available resources for Profile HMMs
  • SUPERFAMILY database incorporates
  • library of profile HMMs representing all proteins
    of known structure
  • assignments to predicted proteins from all
    completely sequenced genomes
  • search and alignment services
  • models and domain assignments are freely
  • Based on SCOP classification of protein domains
  • SAM HMM iterative procedure used for model
    building and sequence alignment

Available resources for Profile HMMs
  • In Superfamily
  • Each SCOP superfamily is represented as an HMM
  • Model built using SAM procedure based 4 variants
  • accurate structure based alignments
  • hand labeled alignments
  • autonomic alignments using ClustalW
  • sequence members used separately as seeds
  • Assignment of superfamilies
  • for a given sequence, every model is scored
    across the whole sequence using Viterbi scoring
  • model which scores highest has its superfamily
    assigned to the region

  • Introduction to HMMs
  • Profile HMMs
  • Available resources for Profile HMMs
  • Some online demonstrations

Online Demonstrations
  • http//

  • Durbin. R, Eddy. S, Krough. A, and Mitchenson. G,
    Biological Sequence Analysis, Cambridge
    University Press, 2002
  • Baldi. P and Brunak. S, Bioinformatics, the
    Machine Learning Approach, the MIT Press,
    Cambridge, 1998
  • Eddy. S, Profile Hidden Markov Models,
    Bioinformatics Review, vol. 19, no. 8, pp.
    755-763, 1998
  • Karplus. K, Barrett. C, and Hughey. R, Hidden
    Markov models for detecting remote homologies,
    Bioinformatics, vol. 14, no. 10, pp. 846-856,
  • Madera. M, Gough, J, A comparison of profile
    hidden Markov model procedures for remote
    homology detection, Nucleic Acids Research,
    vol. 30, no. 19, pp. 4321-4328, 2002
  • Gough. J, Karplus. K, Hughey. R, and Chothia. C,
    Assignment of Homology to Genome Sequences
    using a Library of Hidden Markov Models that
    represent all Proteins of known structure, J.
    Mol. Biol., 313, pp. 903-919, 2001

  • Yona. G, Levitt. M, Within the Twilight Zone A
    sensitive Profile-Profile comparison tool based
    on Information Theory, J. Mol. Biol., 315,
    1257-1275, 2002
  • Mandera. M, Vogel. C, Kummerfeld. K, Chothia. C,
    and Gough. J, The SUPERFAMILY database in 2004
    additions and improvements, Nucleic Acids
    Research, vol. 32, Database Issue, D235-239, 2004
  • Bateman. A, Birney. E, Durbin. R, Eddy. S, Finn.
    R, Sonnhammer. E, Pfam 3.1 1313 multiple
    alignments and profile HMMs match the majority of
    proteins, Nucleic Acids Research, vol. 27, no.
    1, 1999
  • Andreeva. A, et. al., SCOP database in 2004
    refinements integrate structure and sequence
    family data, Nucleic Acids Research, vol. 32,
    Database Issue, D226-D229,2004
  • Many other online resources and tutorials
