Gene Finding 1 Exons - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Gene Finding 1 Exons

Description:

Non-input node are the weighted sum of their inputs put through a 'squashing' function ... Let the derivative of the squashing function be f ' ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 36
Provided by: leahHa2
Category:

less

Transcript and Presenter's Notes

Title: Gene Finding 1 Exons


1
Gene Finding 1 Exons
  • http//compbio.uchsc.edu/hunter/bioi7711

2
Annotation of Genomic Sequence
  • Given the sequence of a genome, we would like to
    be able to identify
  • Genes
  • Exon boundaries splice sites
  • Beginning and end of translation
  • Alternative splicings
  • Regulatory elements (e.g. promoters)
  • Only certain way to do this is experimentally,
    but computational methods can achieve reasonable
    accuracy quickly, and help direct experimental
    approaches.

3
Eukaryotic gene structure
4
Gene Prediction
  • There is no (yet known) perfect method for
    finding genes. All approaches rely on combining
    various weak signals together
  • Find elements of a gene
  • coding sequences (exons)
  • promoters and start signals
  • poly-A tails and downstream signals
  • Assemble into a consistent gene model
  • Use of homologous sequences

5
Exon Finding
  • Essence of any gene annotation scheme for
    Eukaryotic organisms is exon finding. Simplest
    task, and vital precondition.
  • Although information from up- and down- stream
    regions can help, most exon finding approaches
    try to discriminate between exon and non-exon
    sequence.
  • Discrimination problems are widely studied in
    statistical inference and machine learning

6
ORFs
  • ORF Open Reading Frame. Region of sequence
    that has the potential to be translated into a
    protein.
  • Six possible reading frames (3 starts x 2
    strands)
  • Starts with a atg (met)
  • Ends with a stop codon (taa, tag or tga).
  • No intervening stop codons
  • Not all ORFs are exons, but all exons must be in
    an ORF.

7
How to identify ORFs that are exons?
  • Observed length distribution is non-random
  • Long ORFs more likely to be exons
  • Proposed 12 part mixture model of length
    distributions
  • Signatures of exons
  • CpG islands
  • intron splice junctions
  • (hexa)nucleotide frequencies
  • Signatures of non-exons (repeats, ALUs, etc.)
  • Compatibility with surrounding sequence other
    exons (e.g. consistent reading frame)

8
CpG Islands
  • CpG islands are regions of sequence that have a
    high proportion of CG dinucleotide pairs (p is a
    phoshodiester bond linking them)
  • CpG islands are present in the promoter and
    exonic regions of approximately 40 of mammalian
    genes
  • Other regions of the mammalian genome contain few
    CpG dinucleotides and these are largely
    methylated
  • Definition sequences of gt500 bp with
  • GC ? 55
  • Observed(CpG)/Expected(CpG) ? 0.65

9
Splice junctions
  • Most Eukaryotic introns have a consensus splice
    signal GU at the beginning (donor), AG at the
    end (acceptor).
  • Variation does occur in the splice sites
  • Many AGs and GTs are not splice sites.
  • Database of experimentally validated human splice
    sites http//www.ebi.ac.uk/thanaraj/splice.html

10
Hexanucleotide frequencies
  • Amino acid distributions are biased e.g. p(A) gt
    p(C)
  • Pairwise distributions also biasede.g.
    p(AT)/p(A)p(T) gt p(AC)/p(A)p(C)
  • Nucleotides that code for preferred amino acids
    (and AA pairs) occur more frequently in coding
    regions than in non-coding regions.
  • Codon biases (per amino acid)
  • Hexanucleotide distributions that reflect those
    biases indicate coding regions.

11
Issues in 6mer frequency
  • Sliding window across all 6 reading frames.
  • Significance of a score?
  • In order to get good statistics on hexamer
    frequencies in an ORF, it has to be long
  • Amino acid dimer (and nucleotide hexamer)
    frequencies vary by organism.
  • Using frequencies from one organism (or a
    consensus) for another gives a noisier signal yet.

12
General challenges
  • Short exons are hard to find, but not uncommon.
    Shortest human exon is 3bp!
  • First and last exons are particularly difficult
  • No bounding intron, so no splice site signal
    (although start and end codons are related
    signals)
  • Generally contain non-coding sequence as well as
    coding sequence, so hexamer signals are diluted.
  • Alternative splicing means that there are
    multiple true solutions for some genes.

13
Probabilistic framework
  • We can express all of these weak signals as
    probabilities
  • P(exon length)
  • P(exon hexamer composition)
  • P(exon CpG content)
  • P(exon adjacent splice signals)
  • etc. etc.
  • How to combine them?
  • Need to know dependency structure!
  • Empirical versus theoretical approaches

14
Inductive Learning
  • The inference of general rules from a set of
    examples a classic problem in CS, statistics...
  • Training examples
  • Must be representative (random is best) and
    labeled
  • Need an adequate number
  • Sometimes benefit from positive and negative
    instances
  • Representation (what aspects of examples?)
  • Kind of rule to be induced (e.g. linear)
  • Algorithm for induction...

15
Representation
  • Most important aspect of inductive learning
  • Most common fixed length feature vector.
  • Feature observable value related to task
  • Fixed length vector a particular list of
    features which has the same meaning from example
    to example
  • Some feature sets (like sequences) have variable
    length or can have variable meaning
  • Can translate by sliding window
  • Need all the relevant features and not too many
    irrelevant ones (for amount of data)

16
HEXON/FEX
  • Early, simple approach. Linear discriminant
    analysis (LDA).
  • Training set (exons and non-exons)
  • Set of features (for internal exons)
  • donor/acceptor site recognizers
  • octonucleotide preferences for coding region
  • octonucleotide preferences for intron interiors
    on either side
  • Rule to be induced linear combination of
    features.
  • Moderately accurate

17
Linear Discriminant Analysis
  • Imagine we set a pseudovariable Y to be 1 when
    an example is an exon, and -1 otherwise
  • Let the F features in the vector be Xi, i?F
  • Do multiple regression over the examples to
    calculate the least squares fit of coefficients a
    and ci to
  • For new examples to be tested, if Y gt 0, then
    call it an exon.


18
Linear discrimination picture
19
HMM for exons
  • Large, internal exons only
  • Training sequences
  • Initial training on 100 nucleotides of intron
  • Followed by 100-200 (avg 142) nucleotides of exon
  • Followed by 100 nucleotides of intron.
  • Found a composition periodicity of size 10,
    apparently due to position in nucleosome.

20
Exon HMM periodicity
  • Donor and acceptorare clear, as is G rich
    region near donor

10 state circular HMM probability is nearly as
high as linear model!
21
GRAIL
  • Similar approach to FEX, but different feature
    list, and a neural network to combine features.
  • First, uses about 30 manually derived rules to
    exclude impossible candidates (95!)
  • Then neural network with these features
  • Coding probability (from hexamers)
  • GC composition
  • length
  • splice site signal strength measure
  • intron score for adjacent regions

22
Neural Networks
  • Method for inducing non-linear predictive
    combinations from examples
  • Metaphor to real neurons.

-.2
...
37
2.4
.5
0
15
.05
-3
.78
23
How do NN's work?
  • Each node has a value, and each arc a weight
  • Values of input nodes are set by each example
  • Non-input node are the weighted sum of their
    inputs put through a "squashing" function

n4 f(n1w1n2w2 n3w3)
n1
w1
w2
n2
w3
n3
24
Using a NN for prediction
  • Each layer gets inputs only from the layer below
    it, and gives output only to the layer above it.
    A feed forward topology.
  • Given the weights and input values, start with
    the inputs and work forward to an output.
  • Output node can be normalized to 0-1 for
    probability
  • Input to each node is a linear combination, but
    output is non-linear.
  • Combination of non-linear combinations can
    approximate a very large number of functions.

25
NN discrimination picture
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
i
e
e
e
e
e
e
i
Feature 1
e
i
e
i
i
i
i
i
i
i
i
e
i
i
Feature 2
26
Where do weights come from?
  • Have training examples with known outputs.
  • Use the "backpropagation" algorithm
  • Start with small random weights
  • For each example, use feed forward to calculate
    the predicted outcome.
  • Calculate the error (difference between predicted
    and actual outcome) and change the weights to
    reduce the amount of error.
  • Repeat.

27
How to change the weights
  • Calculate an error term for each node
  • For the output node, error term (e) target (t)
    output (o)
  • Let the derivative of the squashing function be f
    '. Define the error signal to be f '(sum of
    inputs) e
  • For non-output nodes, the error term is the sum
    of all of the error signals in the layer above,
    multiplied by the weights connecting them to the
    node.
  • ?wierror term ni LWhere ni is the input on
    that weight

28
Weight change picture
n1
w1
w2
n2
w3
n3
?wf '(sum) e ni L
29
Neural Network Training Summary
  • Calculating the partial derivative of each weight
    with respect to the total error, and moving the
    weight a little bit in the direction that reduces
    error
  • A minimization in error space.
  • Generally, convergence to a local minimum in
    training error space gives good performance.
  • But...

30
Overfitting!
  • If there are too many weights for the number of
    training examples, or training goes on too long,
    generalization performance will be poor.

31
Overfitting picture
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
i
e
e
e
e
e
e
i
Feature 1
e
i
e
i
i
i
i
i
i
i
i
e
i
i
i
i
i
i
Feature 2
32
More on Neural Networks
  • Neural Network FAQ (programs, tutorials, etc.)
    ftp//ftp.sas.com/pub/neural/FAQ.html
  • Different squashing functions have different
    generalization, e.g. radial basis functions.
  • Can add "momentum" term to minimization to avoid
    oscillations
  • Other minimization techniques (e.g. conjugate
    gradient descent) can give faster / better
    convergence and eliminate rate parameter

33
MZEF
  • Michael Zhang's Exon Finder
  • Uses more nuanced view of exon (12 different
    categories)
  • Fancier models of end regions (e.g. splice sites)
  • Quadratic Discrimination Analysis
  • Similar to LDA, but more complex function allows
    for curved discrimination surfaces
  • Better accuracy than either LDA or NN's in a
    third party comparison.

34
QDA picture
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
i
e
i
Feature 1
e
i
e
i
i
i
i
i
i
i
i
e
i
i
i
i
i
i
Feature 2
35
Readings
  • Mount, ch 8, pp. 337-357. Most of this section
    describes Prokaryotic gene identification, which
    is a much easier task. Note that an HMM achieves
    very good accuracy at it.
  • Michael Zhang, Computational Prediction of
    Eukaryotic Protein Coding Genes. Nature Reviews
    Genetics 3 698-709. (2002)
Write a Comment
User Comments (0)
About PowerShow.com