Title: Hidden Markov Models and Gene Finding
1Hidden Markov Models and Gene Finding
- Temidayo Ajayi
- Electrical Engineering and Computer Science
- dajayi_at_ku.edu
2Brief Overview
- Todays goals
- Introduce the concept of Hidden Markov Models as
a general tool used in bioinformatics - Demonstrate its application in gene finding by
reviewing two literature articles
3Learning Objectives
- At the end of my talk you should have
- A knowledge of the terms used in this area of
Bioinformatics and Machine Learning - A genral understanding of HMM
- A knowledge of what Gene Finding is
- An introduction to two kinds of Gene Finders
4Outline
- Introduction of terminology
- Brief intro of HMM
- What is it?
- How is it used?
- Advantages and Disadvantages
- Approaches to Gene Finding
- Application of HMM Gene Finding
5Outline (contd)
- What is Gene Finding?
- Why is it studied?
- Analysis of these approaches
- Sample execution of a gene finder program
- Conclusion
- Questions / Discussion
6Explanation of Terminology
- Base pair (bp) A-T or G-C pairs in the DNA of an
organism - Introns non-coding regions within a gene
- Exons coding regions
- Open Reading Frame (ORF) Coding region of DNA
- Codons mark the beginning and end of an ORF
- GC content - usually expressed as a percentage,
it is the proportion of GC-base pairs in the DNA
molecule or genome sequence being investigated - Acceptor Exon-Intron boundary (EI)
- Donor Intron-Exon boundary (IE)
7Explanation of Terms (contd)
- Splice sites acceptors and donors
8Explanation of Terms (contd)
- Intergenic Region a stretch of DNA sequences
located between clusters of genes that comprise a
large percentage of the human genome but contain
few or no genes.
9The Hidden Markov Model (HMM)
- A finite set of states, each of which is
associated with a (generally multidimensional)
probability distribution . - Transitions among the states are governed by a
set of probabilities called transition
probabilities. - In a particular state an outcome or observation
can be generated, according to the associated
probability distribution. - It is only the outcome, not the state, visible to
an external observer and therefore states are
hidden to the outside, hence the name Hidden
Markov Model
10Problems to be solved by the HMM
- Three canonical problems
- Given the model parameters, compute the
probability of a particular output sequence.
Solved by the forward algorithm - Given the model parameters, find the most likely
sequence of (hidden) states which could have
generated a given output sequence. Solved by the
Viterbi algorithm - Given an output sequence, find the most likely
set of state transition and output probabilities.
Solved by the Baum-Welch algorithm
11HMM Sample Structure
- Model is a linear sequence of nodes
- Squares matches
- Diamonds insertions
- Circles - deletions
12Why HMMs might be a good fit for Gene Finding
- Classification Classifying observations within a
sequence - Order A DNA sequence is a set of ordered
observations - Grammar / Architecture The eukaryotic cell
structure contains needed information - Success measure number of complete exons
correctly labeled - Training data Available from various genome
annotation projects
13HMM Advantages
- Statistical Grounding
- HMMs have a strong mathematical structure and
hence can form the theoretical basis for use in a
wide range of applications - Modularity
- HMMs can be combined into larger HMMs
- Transparency of the Model
- Assuming an architecture with a good design
- People can read the model and make sense of it
- The model itself can help increase understanding
of the original data
14HMM Advantages (contd)
- Incorporation of Prior Knowledge
- Incorporate prior knowledge into the architecture
- Initialize the model close to something believed
to be correct - Use prior knowledge to constrain training process
15How does Gene Finding make use of HMM advantages?
- Statistics
- Many systems alter the training process to better
suit their success measure - Modularity
- Almost all systems use a combination of models,
each individually trained for each gene region - Prior Knowledge
- A fair amount of prior biological knowledge is
built into each architecture
16HMM Disadvantages
- Markov Chains
- States are supposed to be independent
- P(y) must be independent of P(x), and vice versa
- This usually is not true
- Can get around it when relationships are local
P(x)
P(y)
17HMM Disadvantages (contd)
- Some classic Machine Learning Problems
- Watch out for local maxima
- Model may not converge to a truly optimal
parameter set for a given training set - SP.E.ED
- Due to exhaustive enumeration and expansion of
all possible paths through the model
18HMM Overview
- Advantages
- Mathematical Grounding
- Modularity
- Transparency
- Prior Knowledge
- Disadvantages
- State independence
- Local Maximums
- Speed
19Approaches to Gene Finding
- Might need to look at genes we have seen before
- Search Known Databases
- Homology-based gene identification
- Might need to find genes we know nothing about
(Ab initio) - Use purely computational methods
- HMM
- Directed Acyclic Graphs
- Weighed Matrix Methods
20Gene Finder GENSCAN
- Prediction of Complete Gene Structures in Human
Genomic DNA Burge and Karlin - Introduce a general probabilistic model for the
gene structure of human genomic sequences and
describe its application to gene finding in
GENSCAN - GENSCAN uses a three-periodic fifth-order Markov
model of coding regions rather than using
specialized models of particular protein motifs
or data base homology information - Other Gene finders, e.g. Genie, also use this
model, however GENSCAN differs from them. HOW?
21GENSCAN Distinguishing Factors
- Use of an explicitly double-stranded genomic
sequence model in which potential genes occurring
on both DNA strands are analyzed in simultaneous
and integrated fashion - Flexibility of model to contain a partial gene, a
complete gene, or multiple complete or partial
genes, or no gene at all! - A novel (as of 1997) method Maximum Dependence
Decomposition to model functional signals in DNA
(or protein) sequences which allows for
dependencies between signal positions in a fairly
natural and statistically justifiable way
22GENSCAN Comparing other Gene Finders
- Sn Sensitivity
- Sp Specificity
- Ac Approximate Correlation
- ME Missing Exons
- WE Wrong Exons
- GENSCAN Performance Data, http//genes.mit.edu/Acc
uracy.html
23GENSCAN Discussion
- Novel features of the model include
- Use of distinct explicit empirically-derived sets
of model parameters to differentiate between gene
structure and composition between distinct
isochores of the human genome
24GENSCAN Discussion (contd)
- Capacity to predict multiple genes in a sequence,
to deal with partial as well as complete genes,
and to predict consistent genes occurring on
either or both DNA Strands - New statistical models of donor and acceptor
splice sites which capture potentially important
dependencies between signal positions
25Gene Finder ExonHunter
- ExonHunter A comprehensive approach to gene
finding Brejova et al. - Method gathers numerous sources of information
- Genomic sequences
- Expressed Sequence Tags
- Protein Databases
- All information is combined into a gene finder
based on a hidden Markov model in a novel and
systematic way. - Earlier successes of GENSCAN segued into
comparative approaches to gene finding - Experiments show that no one information source
alone is sufficient tyo achieve the same
performance as their combination
26Gene Finder ExonHunter
- An HMM for gene finding defines a conditional
probability distribution over all possible
annotations (sequences of labels) of a specific
sequence - The model utilizes advisors to represent
supplementary information - For each position in the sequence, an advisor
specifies a probability distribution over
annotation labels
27GeneFinder ExonHunter
- Next, all advisors are then combined into a
- SUPERADVISOR
- The superadvisor prediction at a particular
position is a probability distribution over all
labels x (x1, , xn), where xi is the
probability of the ith label from ?, given all
advice - The superadvisor is finally combined with an HMM
28ExonHunter Distinguishing Features
- GC Content Model transition and emission
probabilities depend on GC content level,
estimated from a 1000 bp window around the
current position. - Signal Models Use of higher order trees (HOT)
of order 2 to model acceptors and donor site
signals.
29ExonHunter Distinguishing Features
- Length distributions - divided into head with
arbitrary distribution, as well as a
geometrically decaying tail.
30ExonHunter Experimental Results
31ExonHunter Conclusion
- Model based on probabilistic statements made
using various sources of information, called
advisors - A quadratic programming-based method that
extended a traditional linear combination
approach and adapted the Viterbi algorithm to the
domain - ExonHunter outperforms several other programs
like SLAM and TWINNSCAN
32ExonHunter A trip the Home site
- A brief demonstration of a run of ExonHunter, at
- http//software.bioinformatics.uwaterloo.ca/ex
onhunter/
33Outlook The future of Gene Finding
- A shift from pattern recognition to database
searching and information integration. - Computational methods will still be necessary for
other organisms. - As tools become better, faster, and complete, the
questions to be asked become more interesting
recall GenScan started with small genomic
contigs, Exon-Hunter was able to combine much
more data. Therefore, questions will tend to be
more genome-based, than sequence-based
34HGP Timeline The National Genome Research
Institute http//www.genome.gov/11007154
35Conclusion
- The Hidden Markov Model, HMM, is a finite set of
states that implements transitions based on a
probability distribution - Strong Mathematical grounding
- Modular
- Transparent
- Might encounter local maxima
- Slow
36Conclusion
- GENSCAN
- Ab Initio approach to gene finding
- Features novel algorithm to allow flexibility
with input sequences - Performs well with small genomic contigs
- ExonHunter
- Ab Inition approach to gene finding
- Combines multiple sources of information into HMM
- Uses advisors to make superior decisions
37Questions