Title: Gene prediction
1Gene prediction
2Finding eukaryotic genes
- Up to 99 DNA is non-coding- long ORFs can exist
by chance. - Introns break up ORFs into smaller regions making
them harder to spot. - Introns contain information for their own removal
(splicing) but these sequences may not be well
enough
conserved to make
prediction of these
accurate.
3Gene finding using homology
- Extended homology (similarity of sequence) with a
known gene ? likely to be gene. - Homology searches can also look for short regions
of sequence that are conserved between many
different genes - Several of these protein motifs encoded by a
small region of DNA could indicate the presence
of a gene.
4Multiple alignment for gene prediction
- Homology detection involves sequence alignment by
maximizing a similarity score function. - This can be achieved using the dynamic
programming algorithm
5Dynamic programming
- Alignment- work left to right in both sequences
- At each point can have 1) a match of two letters,
2) a gap in the first sequence matched with a
letter in the second or 3) a letter in the first
with a gap in the second - Eg. Align AIMS with AMOS
- AIM-S
- A-MOS
6Dynamic programming
- Path matrix Put one sequence on vertical axis
and one on horizontal. - Alignment finding best path top-left to
bottom-right
A I M S
- Gap in vertical sequence horizontal line
- Gap in horizontal sequence vertical line
- Match of two letters diagonal line
7Dynamic programming
Dynamic programming algorithm for searching tree
for best path only want to find single best
path- black circles
actually same point, so no need to search through
all paths. Difference in score between these
paths are just the differences in score of path
up to that point, so pick the best and prune of
other paths from black circle.
8Global alignment
- Score function to be optimised is the sum of
weights at each position of the alignment. - Weights are defined by a substitution matrix
among the four nucleotides and also by a gap
penalty- simple example fixed value for matches
and mismatches. - So we can define a score matrix with a value at
each point in the path matrix - Where s(i) is the letter in ith position in
horizontal sequence, t(j) is jth in vertical
sequence, ws,t is the weight for substituting
letter s for t and d is the gap penalty.
9Global alignment
- The initial values of the score matrix are
- Dynamic programming is more efficient than many
standard tree search algorithms, because most
branches pruned according to the score function.
10Local alignment
- Global alignment works when you know where to
start and end, which is not usually the case for
gene finding. - Look instead for short sections of similarity
(eg. motifs). - For this we must define the score matrix slightly
differently to allow the alignment to start
anywhere
11Database searches
- When trying to find homology information on a
sequence of interest, you try to align it with
thousands of sequences of known genes from a
database. - The dynamic programming algorithm which is O(nm)
takes a huge number of computations to search the
database for homology - Recall n, m lengths of target and source
search strings - It can be made efficient by parallel processing,
but faster less rigorous algorithms are often
employed instead.
12FASTA and BLAST
- FASTA and BLAST are approximate, but more
efficient. - FASTA- do dot matrix plot of two sequences-
black dots where the nucleotides in vertical
sequence are the same as those in the horizontal. - By visual inspection locate regions of similarity
which show up as diagonal lines (or, if you allow
for some small gaps, clusters of diagonal
stretches)
13FASTA
- Apply dynamic programming to these located
regions only. - Located regions are small compared to entire
matrix, so the algorithm is speeded up a lot. - Initial search for diagonals is done by hashing-
which is actually usually done on sextuples of
nucleotides, instead of single ones.
14BLAST
- BLAST algorithm uses an alternative approach to
sequence alignment. - Query sequence is split up into words, usually
11 bases long in DNA sequence or 3 amino acids
long in protein sequence. - These words are then compared to a list of all
possible words and a new list is generated,
containing all words that match at what is
considered a significant level. (NB some words
are so common that even a perfect match is not
considered significant.)
15BLAST
- A finite state automaton can then be used to find
any exact matches of these words within a
sequence database efficiently - The regions of both target and query sequence
where words are found to have matched are then
examined further. This is achieved by extending
the alignment locally from either side of the
word until a maximum score is achieved.
16Gene finding with Hidden Markov Models
17Modelling sequences as Hidden Markov processes
- Think of the nucleotide sequence in a piece of
DNA as being a sequence of outputs of a process
that progresses through a series of discrete
states, some of which are hidden to the observer. - Markov assumption is that the state of the system
at position i1 in the sequence depends only on
the state at position i and not on other
positions (usually we will number the sequence in
the direction of transcription). - The Markov assumption is of course not true of
DNA and there are various ways of minimizing the
impact of this wrong assumption.
18An example gene finder VEIL
- The Viterbi Exon-Intron Locator (VEIL) was
developed by John Henderson, Steven Salzberg, and
Ken Fasman at Johns Hopkins University. - Gene finder with a modular structure
- Uses a HMM which is made up of sub-HMMs each to
describe a different bit of the sequence
upstream noncoding DNA, exon, intron, - Assumes test data starts and ends with noncoding
DNA and contains exactly one gene. - Uses biological knowledge to hardwire part of
HMM, eg. start stop codons, splice sites.
19The exon sub-model
20Other submodels
- The start codon model is very simple
- The splice junctions are also quite simple and
can be hardwired (here is the 5 splice site)
Upstream
Exon
a
t
g
21The overall model
Start codon
Stop codon
Downstream
Upstream
Exon
3 splice site
5 polyA site
5 splice site
intron
For more details, see J. Henderson, S.L.
Salzberg, and K. Fasman (1997) Journal of
Computational Biology 42, 127-141.
22Training
- The (sub-)HMMs can be trained by using the
Expectation Maximization algorithm. - The idea is as follows
- You estimate the transition probabilities
(parameters) of the HMMs by picking those which
maximise the probability of obtaining the data. -
- where are the transition
probabilities between states and is
the training set of observed sequences. - Guaranteed to give locally optimal solution.
23Analysing new sequences using the trained HMM
- Once the transition probabilities have been
estimated, the HMM can be fed novel sequences in
order to try to find genes. - The task is essentially to parse the sequence
into intron, exon and noncoding parts. - The Viterbi algorithm (discussed more later in
course) essentially looks for the maximum
likelihood alignment of the test sequence with
the states of the model, using a dynamic
programming strategy. It also calculates the
probability of the model producing the data via
that path of states.
24Higher order Markov chains
- Suppose we want to relax the assumption that the
state at position i1 depends only on the state
at position i and instead depends on the n
previous states (an nth order Markov chain) - This can lead to problems with the training
algorithms, but we can employ a trick and let
states of the HMM correspond to n-tuples of
nucleotides, since - We in fact get a first order Markov chain between
the n-tuples
25Higher order Markov chains
- Eg. suppose we have a sequence of As and Bs
in which each letter depends on the two previous
letters, then we can equivalently consider a
first order Markov chain in (overlapping) pairs
of letters - Higher order Markov chains may be more
appropriate for modelling DNA sequences?
AB
AA
BA
BB
26Testing-Sensitivity and Specificity
- These are measures of the predictive power of the
model (HMM). - They are used to compare different gene finding
programs. - SensitivityTP/(TPFN) SpecificityTP/(TPFP)
(where T/Ftrue/false, eg. T denotes nucleotide
really is in exon P/Npositive/negative, eg. P
denotes nucleotide is predicted to be in exon.) - These test criteria are different from the
objective functions used to train the HMM
(objective function is simply the probability of
data given the model)- this means there is no
guarantee of even local optimality.
27Training and Testing VEIL
- Final accuracy of gene finder depends heavily on
the quality and quantity of the data used to
train it. - They used a set of 570 vertebrate sequences from
different species. - The data had been filtered for quality control
reasons- pseudogenes and various other
non-standard genes or genes which would bias the
data had been removed. - They performed 5-fold cross-validation- both with
and without removal of highly homologous genes
from the test sets.
28Training and testing VEIL
- The testing involved calculating sensitivity and
specificity of both nucleotide labelling (coding/
noncoding) and exons exactly found. Other
measures of accuracy were also used. - The nucleotide sensitivity was roughly 80,
specificity c. 70 - Getting both ends of the exon right is a much
harder test and the figures were sensitivity c.
50, specificity c. 50. - The authors note that they can improve these
statistics by combining this de novo gene
finding method (HMMs) with database search
(homology) methods.
29Conclusions on gene finding
- With so much DNA sequence information available,
it is important to find out how to make sense of
it- to find the functional bits, which are the
genes - This can be achieved using ESTs, but there are
problems with this method. - Alternative methods described in this lecture are
i) database homology searches to find known genes
similar to the query sequence and ii) de novo
pattern finding methods such as HMMs - David Jones will discuss this topic in more
detail.