Gene prediction

About This Presentation

Title:

Gene prediction

Description:

Up to 99% DNA is non-coding- long ORFs can exist by chance. ... Located regions are small compared to entire matrix, so the algorithm is speeded up a lot. ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 30

Provided by: mathUn7

Category:

more less

Transcript and Presenter's Notes

Title: Gene prediction

1
Gene prediction
2
Finding eukaryotic genes

Up to 99 DNA is non-coding- long ORFs can exist
by chance.
Introns break up ORFs into smaller regions making
them harder to spot.
Introns contain information for their own removal
(splicing) but these sequences may not be well
enough
conserved to make
prediction of these
accurate.

3
Gene finding using homology

Extended homology (similarity of sequence) with a
known gene ? likely to be gene.
Homology searches can also look for short regions
of sequence that are conserved between many
different genes
Several of these protein motifs encoded by a
small region of DNA could indicate the presence
of a gene.

4
Multiple alignment for gene prediction

Homology detection involves sequence alignment by
maximizing a similarity score function.
This can be achieved using the dynamic
programming algorithm

5
Dynamic programming

Alignment- work left to right in both sequences
At each point can have 1) a match of two letters,
2) a gap in the first sequence matched with a
letter in the second or 3) a letter in the first
with a gap in the second
Eg. Align AIMS with AMOS
AIM-S
A-MOS

6
Dynamic programming

Path matrix Put one sequence on vertical axis
and one on horizontal.
Alignment finding best path top-left to
bottom-right

A I M S

Gap in vertical sequence horizontal line
Gap in horizontal sequence vertical line
Match of two letters diagonal line

7
Dynamic programming

Search tree

Dynamic programming algorithm for searching tree
for best path only want to find single best
path- black circles
actually same point, so no need to search through
all paths. Difference in score between these
paths are just the differences in score of path
up to that point, so pick the best and prune of
other paths from black circle.
8
Global alignment

Score function to be optimised is the sum of
weights at each position of the alignment.
Weights are defined by a substitution matrix
among the four nucleotides and also by a gap
penalty- simple example fixed value for matches
and mismatches.
So we can define a score matrix with a value at
each point in the path matrix
Where s(i) is the letter in ith position in
horizontal sequence, t(j) is jth in vertical
sequence, ws,t is the weight for substituting
letter s for t and d is the gap penalty.

9
Global alignment

The initial values of the score matrix are
Dynamic programming is more efficient than many
standard tree search algorithms, because most
branches pruned according to the score function.

10
Local alignment

Global alignment works when you know where to
start and end, which is not usually the case for
gene finding.
Look instead for short sections of similarity
(eg. motifs).
For this we must define the score matrix slightly
differently to allow the alignment to start
anywhere

11
Database searches

When trying to find homology information on a
sequence of interest, you try to align it with
thousands of sequences of known genes from a
database.
The dynamic programming algorithm which is O(nm)
takes a huge number of computations to search the
database for homology
Recall n, m lengths of target and source
search strings
It can be made efficient by parallel processing,
but faster less rigorous algorithms are often
employed instead.

12
FASTA and BLAST

FASTA and BLAST are approximate, but more
efficient.
FASTA- do dot matrix plot of two sequences-
black dots where the nucleotides in vertical
sequence are the same as those in the horizontal.
By visual inspection locate regions of similarity
which show up as diagonal lines (or, if you allow
for some small gaps, clusters of diagonal
stretches)

13
FASTA

Apply dynamic programming to these located
regions only.
Located regions are small compared to entire
matrix, so the algorithm is speeded up a lot.
Initial search for diagonals is done by hashing-
which is actually usually done on sextuples of
nucleotides, instead of single ones.

14
BLAST

BLAST algorithm uses an alternative approach to
sequence alignment.
Query sequence is split up into words, usually
11 bases long in DNA sequence or 3 amino acids
long in protein sequence.
These words are then compared to a list of all
possible words and a new list is generated,
containing all words that match at what is
considered a significant level. (NB some words
are so common that even a perfect match is not
considered significant.)

15
BLAST

A finite state automaton can then be used to find
any exact matches of these words within a
sequence database efficiently
The regions of both target and query sequence
where words are found to have matched are then
examined further. This is achieved by extending
the alignment locally from either side of the
word until a maximum score is achieved.

16
Gene finding with Hidden Markov Models
17
Modelling sequences as Hidden Markov processes

Think of the nucleotide sequence in a piece of
DNA as being a sequence of outputs of a process
that progresses through a series of discrete
states, some of which are hidden to the observer.
Markov assumption is that the state of the system
at position i1 in the sequence depends only on
the state at position i and not on other
positions (usually we will number the sequence in
the direction of transcription).
The Markov assumption is of course not true of
DNA and there are various ways of minimizing the
impact of this wrong assumption.

18
An example gene finder VEIL

The Viterbi Exon-Intron Locator (VEIL) was
developed by John Henderson, Steven Salzberg, and
Ken Fasman at Johns Hopkins University.
Gene finder with a modular structure
Uses a HMM which is made up of sub-HMMs each to
describe a different bit of the sequence
upstream noncoding DNA, exon, intron,
Assumes test data starts and ends with noncoding
DNA and contains exactly one gene.
Uses biological knowledge to hardwire part of
HMM, eg. start stop codons, splice sites.

19
The exon sub-model
20
Other submodels

The start codon model is very simple
The splice junctions are also quite simple and
can be hardwired (here is the 5 splice site)

Upstream
Exon
a
t
g
21
The overall model
Start codon
Stop codon
Downstream
Upstream
Exon
3 splice site
5 polyA site
5 splice site
intron
For more details, see J. Henderson, S.L.
Salzberg, and K. Fasman (1997) Journal of
Computational Biology 42, 127-141.
22
Training

The (sub-)HMMs can be trained by using the
Expectation Maximization algorithm.
The idea is as follows
You estimate the transition probabilities
(parameters) of the HMMs by picking those which
maximise the probability of obtaining the data.
where are the transition
probabilities between states and is
the training set of observed sequences.
Guaranteed to give locally optimal solution.

23
Analysing new sequences using the trained HMM

Once the transition probabilities have been
estimated, the HMM can be fed novel sequences in
order to try to find genes.
The task is essentially to parse the sequence
into intron, exon and noncoding parts.
The Viterbi algorithm (discussed more later in
course) essentially looks for the maximum
likelihood alignment of the test sequence with
the states of the model, using a dynamic
programming strategy. It also calculates the
probability of the model producing the data via
that path of states.

24
Higher order Markov chains

Suppose we want to relax the assumption that the
state at position i1 depends only on the state
at position i and instead depends on the n
previous states (an nth order Markov chain)
This can lead to problems with the training
algorithms, but we can employ a trick and let
states of the HMM correspond to n-tuples of
nucleotides, since
We in fact get a first order Markov chain between
the n-tuples

25
Higher order Markov chains

Eg. suppose we have a sequence of As and Bs
in which each letter depends on the two previous
letters, then we can equivalently consider a
first order Markov chain in (overlapping) pairs
of letters
Higher order Markov chains may be more
appropriate for modelling DNA sequences?

AB
AA
BA
BB
26
Testing-Sensitivity and Specificity

These are measures of the predictive power of the
model (HMM).
They are used to compare different gene finding
programs.
SensitivityTP/(TPFN) SpecificityTP/(TPFP)
(where T/Ftrue/false, eg. T denotes nucleotide
really is in exon P/Npositive/negative, eg. P
denotes nucleotide is predicted to be in exon.)
These test criteria are different from the
objective functions used to train the HMM
(objective function is simply the probability of
data given the model)- this means there is no
guarantee of even local optimality.

27
Training and Testing VEIL

Final accuracy of gene finder depends heavily on
the quality and quantity of the data used to
train it.
They used a set of 570 vertebrate sequences from
different species.
The data had been filtered for quality control
reasons- pseudogenes and various other
non-standard genes or genes which would bias the
data had been removed.
They performed 5-fold cross-validation- both with
and without removal of highly homologous genes
from the test sets.

28
Training and testing VEIL

The testing involved calculating sensitivity and
specificity of both nucleotide labelling (coding/
noncoding) and exons exactly found. Other
measures of accuracy were also used.
The nucleotide sensitivity was roughly 80,
specificity c. 70
Getting both ends of the exon right is a much
harder test and the figures were sensitivity c.
50, specificity c. 50.
The authors note that they can improve these
statistics by combining this de novo gene
finding method (HMMs) with database search
(homology) methods.

29
Conclusions on gene finding

With so much DNA sequence information available,
it is important to find out how to make sense of
it- to find the functional bits, which are the
genes
This can be achieved using ESTs, but there are
problems with this method.
Alternative methods described in this lecture are
i) database homology searches to find known genes
similar to the query sequence and ii) de novo
pattern finding methods such as HMMs
David Jones will discuss this topic in more
detail.

Write a Comment

User Comments (0)

About PowerShow.com

Gene prediction - PowerPoint PPT Presentation

Gene prediction

Up to 99% DNA is non-coding- long ORFs can exist by chance. ... Located regions are small compared to entire matrix, so the algorithm is speeded up a lot. ... – PowerPoint PPT presentation