Hidden Markov Models and Gene Finding - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Hidden Markov Models and Gene Finding

Description:

Demonstrate its application in gene finding by reviewing two ... Other Gene finders, e.g. Genie, also use this model, however GENSCAN differs from them. ... – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 38

Provided by: coli58

Category:

more less

Transcript and Presenter's Notes

Title: Hidden Markov Models and Gene Finding

1
Hidden Markov Models and Gene Finding

Temidayo Ajayi
Electrical Engineering and Computer Science
dajayi_at_ku.edu

2
Brief Overview

Todays goals
Introduce the concept of Hidden Markov Models as
a general tool used in bioinformatics
Demonstrate its application in gene finding by
reviewing two literature articles

3
Learning Objectives

At the end of my talk you should have
A knowledge of the terms used in this area of
Bioinformatics and Machine Learning
A genral understanding of HMM
A knowledge of what Gene Finding is
An introduction to two kinds of Gene Finders

4
Outline

Introduction of terminology
Brief intro of HMM
What is it?
How is it used?
Advantages and Disadvantages
Approaches to Gene Finding
Application of HMM Gene Finding

5
Outline (contd)

What is Gene Finding?
Why is it studied?
Analysis of these approaches
Sample execution of a gene finder program
Conclusion
Questions / Discussion

6
Explanation of Terminology

Base pair (bp) A-T or G-C pairs in the DNA of an
organism
Introns non-coding regions within a gene
Exons coding regions
Open Reading Frame (ORF) Coding region of DNA
Codons mark the beginning and end of an ORF
GC content - usually expressed as a percentage,
it is the proportion of GC-base pairs in the DNA
molecule or genome sequence being investigated
Acceptor Exon-Intron boundary (EI)
Donor Intron-Exon boundary (IE)

7
Explanation of Terms (contd)

Splice sites acceptors and donors

8
Explanation of Terms (contd)

Intergenic Region a stretch of DNA sequences
located between clusters of genes that comprise a
large percentage of the human genome but contain
few or no genes.

9
The Hidden Markov Model (HMM)

A finite set of states, each of which is
associated with a (generally multidimensional)
probability distribution .
Transitions among the states are governed by a
set of probabilities called transition
probabilities.
In a particular state an outcome or observation
can be generated, according to the associated
probability distribution.
It is only the outcome, not the state, visible to
an external observer and therefore states are
hidden to the outside, hence the name Hidden
Markov Model

10
Problems to be solved by the HMM

Three canonical problems
Given the model parameters, compute the
probability of a particular output sequence.
Solved by the forward algorithm
Given the model parameters, find the most likely
sequence of (hidden) states which could have
generated a given output sequence. Solved by the
Viterbi algorithm
Given an output sequence, find the most likely
set of state transition and output probabilities.
Solved by the Baum-Welch algorithm

11
HMM Sample Structure

Model is a linear sequence of nodes
Squares matches
Diamonds insertions
Circles - deletions

12
Why HMMs might be a good fit for Gene Finding

Classification Classifying observations within a
sequence
Order A DNA sequence is a set of ordered
observations
Grammar / Architecture The eukaryotic cell
structure contains needed information
Success measure number of complete exons
correctly labeled
Training data Available from various genome
annotation projects

13
HMM Advantages

Statistical Grounding
HMMs have a strong mathematical structure and
hence can form the theoretical basis for use in a
wide range of applications
Modularity
HMMs can be combined into larger HMMs
Transparency of the Model
Assuming an architecture with a good design
People can read the model and make sense of it
The model itself can help increase understanding
of the original data

14
HMM Advantages (contd)

Incorporation of Prior Knowledge
Incorporate prior knowledge into the architecture
Initialize the model close to something believed
to be correct
Use prior knowledge to constrain training process

15
How does Gene Finding make use of HMM advantages?

Statistics
Many systems alter the training process to better
suit their success measure
Modularity
Almost all systems use a combination of models,
each individually trained for each gene region
Prior Knowledge
A fair amount of prior biological knowledge is
built into each architecture

16
HMM Disadvantages

Markov Chains
States are supposed to be independent
P(y) must be independent of P(x), and vice versa
This usually is not true
Can get around it when relationships are local

P(x)
P(y)

17
HMM Disadvantages (contd)

Some classic Machine Learning Problems
Watch out for local maxima
Model may not converge to a truly optimal
parameter set for a given training set
SP.E.ED
Due to exhaustive enumeration and expansion of
all possible paths through the model

18
HMM Overview

Advantages
Mathematical Grounding
Modularity
Transparency
Prior Knowledge

Disadvantages
State independence
Local Maximums
Speed

19
Approaches to Gene Finding

Might need to look at genes we have seen before
Search Known Databases
Homology-based gene identification
Might need to find genes we know nothing about
(Ab initio)
Use purely computational methods
HMM
Directed Acyclic Graphs
Weighed Matrix Methods

20
Gene Finder GENSCAN

Prediction of Complete Gene Structures in Human
Genomic DNA Burge and Karlin
Introduce a general probabilistic model for the
gene structure of human genomic sequences and
describe its application to gene finding in
GENSCAN
GENSCAN uses a three-periodic fifth-order Markov
model of coding regions rather than using
specialized models of particular protein motifs
or data base homology information
Other Gene finders, e.g. Genie, also use this
model, however GENSCAN differs from them. HOW?

21
GENSCAN Distinguishing Factors

Use of an explicitly double-stranded genomic
sequence model in which potential genes occurring
on both DNA strands are analyzed in simultaneous
and integrated fashion
Flexibility of model to contain a partial gene, a
complete gene, or multiple complete or partial
genes, or no gene at all!
A novel (as of 1997) method Maximum Dependence
Decomposition to model functional signals in DNA
(or protein) sequences which allows for
dependencies between signal positions in a fairly
natural and statistically justifiable way

22
GENSCAN Comparing other Gene Finders

Sn Sensitivity
Sp Specificity
Ac Approximate Correlation
ME Missing Exons
WE Wrong Exons
GENSCAN Performance Data, http//genes.mit.edu/Acc
uracy.html

23
GENSCAN Discussion

Novel features of the model include
Use of distinct explicit empirically-derived sets
of model parameters to differentiate between gene
structure and composition between distinct
isochores of the human genome

24
GENSCAN Discussion (contd)

Capacity to predict multiple genes in a sequence,
to deal with partial as well as complete genes,
and to predict consistent genes occurring on
either or both DNA Strands
New statistical models of donor and acceptor
splice sites which capture potentially important
dependencies between signal positions

25
Gene Finder ExonHunter

ExonHunter A comprehensive approach to gene
finding Brejova et al.
Method gathers numerous sources of information
Genomic sequences
Expressed Sequence Tags
Protein Databases
All information is combined into a gene finder
based on a hidden Markov model in a novel and
systematic way.
Earlier successes of GENSCAN segued into
comparative approaches to gene finding
Experiments show that no one information source
alone is sufficient tyo achieve the same
performance as their combination

26
Gene Finder ExonHunter

An HMM for gene finding defines a conditional
probability distribution over all possible
annotations (sequences of labels) of a specific
sequence
The model utilizes advisors to represent
supplementary information
For each position in the sequence, an advisor
specifies a probability distribution over
annotation labels

27
GeneFinder ExonHunter

Next, all advisors are then combined into a
SUPERADVISOR
The superadvisor prediction at a particular
position is a probability distribution over all
labels x (x1, , xn), where xi is the
probability of the ith label from ?, given all
advice
The superadvisor is finally combined with an HMM

28
ExonHunter Distinguishing Features

GC Content Model transition and emission
probabilities depend on GC content level,
estimated from a 1000 bp window around the
current position.
Signal Models Use of higher order trees (HOT)
of order 2 to model acceptors and donor site
signals.

29
ExonHunter Distinguishing Features

Length distributions - divided into head with
arbitrary distribution, as well as a
geometrically decaying tail.

30
ExonHunter Experimental Results
31
ExonHunter Conclusion

Model based on probabilistic statements made
using various sources of information, called
advisors
A quadratic programming-based method that
extended a traditional linear combination
approach and adapted the Viterbi algorithm to the
domain
ExonHunter outperforms several other programs
like SLAM and TWINNSCAN

32
ExonHunter A trip the Home site

A brief demonstration of a run of ExonHunter, at
http//software.bioinformatics.uwaterloo.ca/ex
onhunter/

33
Outlook The future of Gene Finding

A shift from pattern recognition to database
searching and information integration.
Computational methods will still be necessary for
other organisms.
As tools become better, faster, and complete, the
questions to be asked become more interesting
recall GenScan started with small genomic
contigs, Exon-Hunter was able to combine much
more data. Therefore, questions will tend to be
more genome-based, than sequence-based

34
HGP Timeline The National Genome Research
Institute http//www.genome.gov/11007154
35
Conclusion

The Hidden Markov Model, HMM, is a finite set of
states that implements transitions based on a
probability distribution
Strong Mathematical grounding
Modular
Transparent
Might encounter local maxima
Slow

36
Conclusion