Gene prediction - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Gene prediction

Description:

Up to 99% DNA is non-coding- long ORFs can exist by chance. ... Located regions are small compared to entire matrix, so the algorithm is speeded up a lot. ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 30
Provided by: mathUn7
Category:

less

Transcript and Presenter's Notes

Title: Gene prediction


1
Gene prediction
2
Finding eukaryotic genes
  • Up to 99 DNA is non-coding- long ORFs can exist
    by chance.
  • Introns break up ORFs into smaller regions making
    them harder to spot.
  • Introns contain information for their own removal
    (splicing) but these sequences may not be well
    enough
    conserved to make
    prediction of these
    accurate.

3
Gene finding using homology
  • Extended homology (similarity of sequence) with a
    known gene ? likely to be gene.
  • Homology searches can also look for short regions
    of sequence that are conserved between many
    different genes
  • Several of these protein motifs encoded by a
    small region of DNA could indicate the presence
    of a gene.

4
Multiple alignment for gene prediction
  • Homology detection involves sequence alignment by
    maximizing a similarity score function.
  • This can be achieved using the dynamic
    programming algorithm

5
Dynamic programming
  • Alignment- work left to right in both sequences
  • At each point can have 1) a match of two letters,
    2) a gap in the first sequence matched with a
    letter in the second or 3) a letter in the first
    with a gap in the second
  • Eg. Align AIMS with AMOS
  • AIM-S
  • A-MOS

6
Dynamic programming
  • Path matrix Put one sequence on vertical axis
    and one on horizontal.
  • Alignment finding best path top-left to
    bottom-right

A I M S
  • Gap in vertical sequence horizontal line
  • Gap in horizontal sequence vertical line
  • Match of two letters diagonal line

7
Dynamic programming
  • Search tree

Dynamic programming algorithm for searching tree
for best path only want to find single best
path- black circles
actually same point, so no need to search through
all paths. Difference in score between these
paths are just the differences in score of path
up to that point, so pick the best and prune of
other paths from black circle.
8
Global alignment
  • Score function to be optimised is the sum of
    weights at each position of the alignment.
  • Weights are defined by a substitution matrix
    among the four nucleotides and also by a gap
    penalty- simple example fixed value for matches
    and mismatches.
  • So we can define a score matrix with a value at
    each point in the path matrix
  • Where s(i) is the letter in ith position in
    horizontal sequence, t(j) is jth in vertical
    sequence, ws,t is the weight for substituting
    letter s for t and d is the gap penalty.

9
Global alignment
  • The initial values of the score matrix are
  • Dynamic programming is more efficient than many
    standard tree search algorithms, because most
    branches pruned according to the score function.

10
Local alignment
  • Global alignment works when you know where to
    start and end, which is not usually the case for
    gene finding.
  • Look instead for short sections of similarity
    (eg. motifs).
  • For this we must define the score matrix slightly
    differently to allow the alignment to start
    anywhere

11
Database searches
  • When trying to find homology information on a
    sequence of interest, you try to align it with
    thousands of sequences of known genes from a
    database.
  • The dynamic programming algorithm which is O(nm)
    takes a huge number of computations to search the
    database for homology
  • Recall n, m lengths of target and source
    search strings
  • It can be made efficient by parallel processing,
    but faster less rigorous algorithms are often
    employed instead.

12
FASTA and BLAST
  • FASTA and BLAST are approximate, but more
    efficient.
  • FASTA- do dot matrix plot of two sequences-
    black dots where the nucleotides in vertical
    sequence are the same as those in the horizontal.
  • By visual inspection locate regions of similarity
    which show up as diagonal lines (or, if you allow
    for some small gaps, clusters of diagonal
    stretches)

13
FASTA
  • Apply dynamic programming to these located
    regions only.
  • Located regions are small compared to entire
    matrix, so the algorithm is speeded up a lot.
  • Initial search for diagonals is done by hashing-
    which is actually usually done on sextuples of
    nucleotides, instead of single ones.

14
BLAST
  • BLAST algorithm uses an alternative approach to
    sequence alignment.
  • Query sequence is split up into words, usually
    11 bases long in DNA sequence or 3 amino acids
    long in protein sequence.
  • These words are then compared to a list of all
    possible words and a new list is generated,
    containing all words that match at what is
    considered a significant level. (NB some words
    are so common that even a perfect match is not
    considered significant.)

15
BLAST
  • A finite state automaton can then be used to find
    any exact matches of these words within a
    sequence database efficiently
  • The regions of both target and query sequence
    where words are found to have matched are then
    examined further. This is achieved by extending
    the alignment locally from either side of the
    word until a maximum score is achieved.

16
Gene finding with Hidden Markov Models
17
Modelling sequences as Hidden Markov processes
  • Think of the nucleotide sequence in a piece of
    DNA as being a sequence of outputs of a process
    that progresses through a series of discrete
    states, some of which are hidden to the observer.
  • Markov assumption is that the state of the system
    at position i1 in the sequence depends only on
    the state at position i and not on other
    positions (usually we will number the sequence in
    the direction of transcription).
  • The Markov assumption is of course not true of
    DNA and there are various ways of minimizing the
    impact of this wrong assumption.

18
An example gene finder VEIL
  • The Viterbi Exon-Intron Locator (VEIL) was
    developed by John Henderson, Steven Salzberg, and
    Ken Fasman at Johns Hopkins University.
  • Gene finder with a modular structure
  • Uses a HMM which is made up of sub-HMMs each to
    describe a different bit of the sequence
    upstream noncoding DNA, exon, intron,
  • Assumes test data starts and ends with noncoding
    DNA and contains exactly one gene.
  • Uses biological knowledge to hardwire part of
    HMM, eg. start stop codons, splice sites.

19
The exon sub-model
20
Other submodels
  • The start codon model is very simple
  • The splice junctions are also quite simple and
    can be hardwired (here is the 5 splice site)

Upstream
Exon
a
t
g
21
The overall model
Start codon
Stop codon
Downstream
Upstream
Exon
3 splice site
5 polyA site
5 splice site
intron
For more details, see J. Henderson, S.L.
Salzberg, and K. Fasman (1997) Journal of
Computational Biology 42, 127-141.
22
Training
  • The (sub-)HMMs can be trained by using the
    Expectation Maximization algorithm.
  • The idea is as follows
  • You estimate the transition probabilities
    (parameters) of the HMMs by picking those which
    maximise the probability of obtaining the data.
  • where are the transition
    probabilities between states and is
    the training set of observed sequences.
  • Guaranteed to give locally optimal solution.

23
Analysing new sequences using the trained HMM
  • Once the transition probabilities have been
    estimated, the HMM can be fed novel sequences in
    order to try to find genes.
  • The task is essentially to parse the sequence
    into intron, exon and noncoding parts.
  • The Viterbi algorithm (discussed more later in
    course) essentially looks for the maximum
    likelihood alignment of the test sequence with
    the states of the model, using a dynamic
    programming strategy. It also calculates the
    probability of the model producing the data via
    that path of states.

24
Higher order Markov chains
  • Suppose we want to relax the assumption that the
    state at position i1 depends only on the state
    at position i and instead depends on the n
    previous states (an nth order Markov chain)
  • This can lead to problems with the training
    algorithms, but we can employ a trick and let
    states of the HMM correspond to n-tuples of
    nucleotides, since
  • We in fact get a first order Markov chain between
    the n-tuples

25
Higher order Markov chains
  • Eg. suppose we have a sequence of As and Bs
    in which each letter depends on the two previous
    letters, then we can equivalently consider a
    first order Markov chain in (overlapping) pairs
    of letters
  • Higher order Markov chains may be more
    appropriate for modelling DNA sequences?

AB
AA
BA
BB
26
Testing-Sensitivity and Specificity
  • These are measures of the predictive power of the
    model (HMM).
  • They are used to compare different gene finding
    programs.
  • SensitivityTP/(TPFN) SpecificityTP/(TPFP)
    (where T/Ftrue/false, eg. T denotes nucleotide
    really is in exon P/Npositive/negative, eg. P
    denotes nucleotide is predicted to be in exon.)
  • These test criteria are different from the
    objective functions used to train the HMM
    (objective function is simply the probability of
    data given the model)- this means there is no
    guarantee of even local optimality.

27
Training and Testing VEIL
  • Final accuracy of gene finder depends heavily on
    the quality and quantity of the data used to
    train it.
  • They used a set of 570 vertebrate sequences from
    different species.
  • The data had been filtered for quality control
    reasons- pseudogenes and various other
    non-standard genes or genes which would bias the
    data had been removed.
  • They performed 5-fold cross-validation- both with
    and without removal of highly homologous genes
    from the test sets.

28
Training and testing VEIL
  • The testing involved calculating sensitivity and
    specificity of both nucleotide labelling (coding/
    noncoding) and exons exactly found. Other
    measures of accuracy were also used.
  • The nucleotide sensitivity was roughly 80,
    specificity c. 70
  • Getting both ends of the exon right is a much
    harder test and the figures were sensitivity c.
    50, specificity c. 50.
  • The authors note that they can improve these
    statistics by combining this de novo gene
    finding method (HMMs) with database search
    (homology) methods.

29
Conclusions on gene finding
  • With so much DNA sequence information available,
    it is important to find out how to make sense of
    it- to find the functional bits, which are the
    genes
  • This can be achieved using ESTs, but there are
    problems with this method.
  • Alternative methods described in this lecture are
    i) database homology searches to find known genes
    similar to the query sequence and ii) de novo
    pattern finding methods such as HMMs
  • David Jones will discuss this topic in more
    detail.
Write a Comment
User Comments (0)
About PowerShow.com