Gene Finding 1 Exons - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Gene Finding 1 Exons

Description:

Non-input node are the weighted sum of their inputs put through a 'squashing' function ... Let the derivative of the squashing function be f ' ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 36

Provided by: leahHa2

Category:

more less

Transcript and Presenter's Notes

Title: Gene Finding 1 Exons

1
Gene Finding 1 Exons

http//compbio.uchsc.edu/hunter/bioi7711

2
Annotation of Genomic Sequence

Given the sequence of a genome, we would like to
be able to identify
Genes
Exon boundaries splice sites
Beginning and end of translation
Alternative splicings
Regulatory elements (e.g. promoters)
Only certain way to do this is experimentally,
but computational methods can achieve reasonable
accuracy quickly, and help direct experimental
approaches.

3
Eukaryotic gene structure
4
Gene Prediction

There is no (yet known) perfect method for
finding genes. All approaches rely on combining
various weak signals together
Find elements of a gene
coding sequences (exons)
promoters and start signals
poly-A tails and downstream signals
Assemble into a consistent gene model
Use of homologous sequences

5
Exon Finding

Essence of any gene annotation scheme for
Eukaryotic organisms is exon finding. Simplest
task, and vital precondition.
Although information from up- and down- stream
regions can help, most exon finding approaches
try to discriminate between exon and non-exon
sequence.
Discrimination problems are widely studied in
statistical inference and machine learning

6
ORFs

ORF Open Reading Frame. Region of sequence
that has the potential to be translated into a
protein.
Six possible reading frames (3 starts x 2
strands)
Starts with a atg (met)
Ends with a stop codon (taa, tag or tga).
No intervening stop codons
Not all ORFs are exons, but all exons must be in
an ORF.

7
How to identify ORFs that are exons?

Observed length distribution is non-random
Long ORFs more likely to be exons
Proposed 12 part mixture model of length
distributions
Signatures of exons
CpG islands
intron splice junctions
(hexa)nucleotide frequencies
Signatures of non-exons (repeats, ALUs, etc.)
Compatibility with surrounding sequence other
exons (e.g. consistent reading frame)

8
CpG Islands

CpG islands are regions of sequence that have a
high proportion of CG dinucleotide pairs (p is a
phoshodiester bond linking them)
CpG islands are present in the promoter and
exonic regions of approximately 40 of mammalian
genes
Other regions of the mammalian genome contain few
CpG dinucleotides and these are largely
methylated
Definition sequences of gt500 bp with
GC ? 55
Observed(CpG)/Expected(CpG) ? 0.65

9
Splice junctions

Most Eukaryotic introns have a consensus splice
signal GU at the beginning (donor), AG at the
end (acceptor).
Variation does occur in the splice sites
Many AGs and GTs are not splice sites.
Database of experimentally validated human splice
sites http//www.ebi.ac.uk/thanaraj/splice.html

10
Hexanucleotide frequencies

Amino acid distributions are biased e.g. p(A) gt
p(C)
Pairwise distributions also biasede.g.
p(AT)/p(A)p(T) gt p(AC)/p(A)p(C)
Nucleotides that code for preferred amino acids
(and AA pairs) occur more frequently in coding
regions than in non-coding regions.
Codon biases (per amino acid)
Hexanucleotide distributions that reflect those
biases indicate coding regions.

11
Issues in 6mer frequency

Sliding window across all 6 reading frames.
Significance of a score?
In order to get good statistics on hexamer
frequencies in an ORF, it has to be long
Amino acid dimer (and nucleotide hexamer)
frequencies vary by organism.
Using frequencies from one organism (or a
consensus) for another gives a noisier signal yet.

12
General challenges

Short exons are hard to find, but not uncommon.
Shortest human exon is 3bp!
First and last exons are particularly difficult
No bounding intron, so no splice site signal
(although start and end codons are related
signals)
Generally contain non-coding sequence as well as
coding sequence, so hexamer signals are diluted.
Alternative splicing means that there are
multiple true solutions for some genes.

13
Probabilistic framework

We can express all of these weak signals as
probabilities
P(exon length)
P(exon hexamer composition)
P(exon CpG content)
P(exon adjacent splice signals)
etc. etc.
How to combine them?
Need to know dependency structure!
Empirical versus theoretical approaches

14
Inductive Learning

The inference of general rules from a set of
examples a classic problem in CS, statistics...
Training examples
Must be representative (random is best) and
labeled
Need an adequate number
Sometimes benefit from positive and negative
instances
Representation (what aspects of examples?)
Kind of rule to be induced (e.g. linear)
Algorithm for induction...

15
Representation

Most important aspect of inductive learning
Most common fixed length feature vector.
Feature observable value related to task
Fixed length vector a particular list of
features which has the same meaning from example
to example
Some feature sets (like sequences) have variable
length or can have variable meaning
Can translate by sliding window
Need all the relevant features and not too many
irrelevant ones (for amount of data)

16
HEXON/FEX

Early, simple approach. Linear discriminant
analysis (LDA).
Training set (exons and non-exons)
Set of features (for internal exons)
donor/acceptor site recognizers
octonucleotide preferences for coding region
octonucleotide preferences for intron interiors
on either side
Rule to be induced linear combination of
features.
Moderately accurate

17
Linear Discriminant Analysis

Imagine we set a pseudovariable Y to be 1 when
an example is an exon, and -1 otherwise
Let the F features in the vector be Xi, i?F
Do multiple regression over the examples to
calculate the least squares fit of coefficients a
and ci to
For new examples to be tested, if Y gt 0, then
call it an exon.

18
Linear discrimination picture
19
HMM for exons

Large, internal exons only
Training sequences
Initial training on 100 nucleotides of intron
Followed by 100-200 (avg 142) nucleotides of exon
Followed by 100 nucleotides of intron.
Found a composition periodicity of size 10,
apparently due to position in nucleosome.

20
Exon HMM periodicity

Donor and acceptorare clear, as is G rich
region near donor

10 state circular HMM probability is nearly as
high as linear model!
21
GRAIL

Similar approach to FEX, but different feature
list, and a neural network to combine features.
First, uses about 30 manually derived rules to
exclude impossible candidates (95!)
Then neural network with these features
Coding probability (from hexamers)
GC composition
length
splice site signal strength measure
intron score for adjacent regions

22
Neural Networks

Method for inducing non-linear predictive
combinations from examples
Metaphor to real neurons.

-.2
...
37
2.4
.5
0
15
.05
-3
.78
23
How do NN's work?

Each node has a value, and each arc a weight
Values of input nodes are set by each example
Non-input node are the weighted sum of their
inputs put through a "squashing" function

n4 f(n1w1n2w2 n3w3)
n1
w1
w2
n2
w3
n3
24
Using a NN for prediction

Each layer gets inputs only from the layer below
it, and gives output only to the layer above it.
A feed forward topology.
Given the weights and input values, start with
the inputs and work forward to an output.
Output node can be normalized to 0-1 for
probability
Input to each node is a linear combination, but
output is non-linear.
Combination of non-linear combinations can
approximate a very large number of functions.

25
NN discrimination picture
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
i
e
e
e
e
e
e
i
Feature 1
e
i
e
i
i
i
i
i
i
i
i
e
i
i
Feature 2
26
Where do weights come from?

Have training examples with known outputs.
Use the "backpropagation" algorithm
Start with small random weights
For each example, use feed forward to calculate
the predicted outcome.
Calculate the error (difference between predicted
and actual outcome) and change the weights to
reduce the amount of error.
Repeat.

27
How to change the weights

Calculate an error term for each node
For the output node, error term (e) target (t)
output (o)
Let the derivative of the squashing function be f
'. Define the error signal to be f '(sum of
inputs) e
For non-output nodes, the error term is the sum
of all of the error signals in the layer above,
multiplied by the weights connecting them to the
node.
?wierror term ni LWhere ni is the input on
that weight

28
Weight change picture
n1
w1
w2
n2
w3
n3
?wf '(sum) e ni L
29
Neural Network Training Summary

Calculating the partial derivative of each weight
with respect to the total error, and moving the
weight a little bit in the direction that reduces
error
A minimization in error space.
Generally, convergence to a local minimum in
training error space gives good performance.
But...

30
Overfitting!

If there are too many weights for the number of
training examples, or training goes on too long,
generalization performance will be poor.

31
Overfitting picture
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
i
e
e
e
e
e
e
i
Feature 1
e
i
e
i
i
i
i
i
i
i
i
e
i
i
i
i
i
i
Feature 2
32
More on Neural Networks

Neural Network FAQ (programs, tutorials, etc.)
ftp//ftp.sas.com/pub/neural/FAQ.html
Different squashing functions have different
generalization, e.g. radial basis functions.
Can add "momentum" term to minimization to avoid
oscillations
Other minimization techniques (e.g. conjugate
gradient descent) can give faster / better
convergence and eliminate rate parameter

33
MZEF

Michael Zhang's Exon Finder
Uses more nuanced view of exon (12 different
categories)
Fancier models of end regions (e.g. splice sites)
Quadratic Discrimination Analysis
Similar to LDA, but more complex function allows
for curved discrimination surfaces
Better accuracy than either LDA or NN's in a
third party comparison.

34
QDA picture
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
i
e
i
Feature 1
e
i
e
i
i
i
i
i
i
i
i
e
i
i
i
i
i
i
Feature 2
35
Readings

Mount, ch 8, pp. 337-357. Most of this section
describes Prokaryotic gene identification, which
is a much easier task. Note that an HMM achieves
very good accuracy at it.
Michael Zhang, Computational Prediction of
Eukaryotic Protein Coding Genes. Nature Reviews
Genetics 3 698-709. (2002)

Write a Comment

User Comments (0)