I' Gene Finding presentation

About This Presentation

Transcript and Presenter's Notes

Title: I' Gene Finding

1
I. Gene Finding

I.1. Introduction
I.2. Approaches to (Protein-Coding) Gene Finding
I.2.1 Signature of Exons and non-Exons
I.2.2 Approaches
I.3. Other Topics
I.3.1. Alternative Splicing
I.3.2. RNA Gene Prediction
I.3.3. Other Gene Components Prediction
I.3.4. Gene Ontology Prediction
--------------------------------------------------
-----------------
Readings Gene Annotation Prediction and Testing

2
I.1. Introduction The importance of gene finding

Complete sequencing of human genome. SO WHAT??!!!
Given the sequence of a genome, we would like to
be able to identify
Genes
Exon boundaries splice sites
Beginning and end of translation
Alternative splicings
Regulatory elements (e.g. promoters)
Subsequently, gene regulatory network, signaling
network, metabolic network.
Only certain way to do this is experimentally,
but computational methods can achieve reasonable
accuracy quickly, and help direct experimental
approaches.

3
I.1. Introduction Eukaryotic gene structure
4
I.1 Introduction NCBI RefSeq Sequences

5 years ago, estimated gt 100,000 human genes.
Now, 30000 ? 25000
NCBI RefSeq
RefSeq Code http//www.ncbi.nlm.nih.gov/RefSeq/ke
y.htmlstatus
Only 25,000? Well, isoforms.

????, not very reliable
Probably ok
Good
5
I.1 Introduction Gene Prediction

There is no (yet known) perfect method for
finding genes. All approaches rely on combining
various weak signals together
Find elements of a gene
coding sequences (exons) finding splice site
start signals Translation Initiation Site (TIS),
Transcription Start Site (TSS)
Promoter and other transcription factor binding
site
poly-A tails and downstream signals
Assemble into a consistent gene model

6
I.1 Introduction General challenges

Short exons are hard to find, but not uncommon.
Shortest human exon is 3bp!
First and last exons are particularly difficult
No bounding intron, so no splice site signal
(although start and end codons are related
signals)
Generally contain non-coding sequence as well as
coding sequence, so hexamer signals are diluted.
Alternative splicing (isoforms) means that there
are multiple true solutions for some genes.
For human genome, long introns of few thousands
are very common, making problem extremely
difficult.

7
I.1. Introduction The importances of splice site
prediction
Perfect Prediction (ignoring isoform for now)
Missing out a gene
False negative
Computationally expansive
False positive
8
I.2.1 Signature Signatures of exons and non-exons

Signatures of splice sites (exons)
CpG islands
intron splice junctions
(hexa)nucleotide frequencies
Signatures of non-exons (repeats, ALUs, etc.)
Compatibility with surrounding sequences other
exons (e.g. consistent reading frame)

9
I.2.1 Signature CpG Islands

CpG islands are regions of sequence that have a
high proportion of CG dinucleotide pairs (p is a
phoshodiester bond linking them)
CpG islands are present in the promoter and
exonic regions of approximately 40 of mammalian
genes
Other regions of the mammalian genome contain few
CpG dinucleotides and these are largely
methylated
Methylation disables transcription factor binding
Definition sequences of gt500 bp with
GC ? 55
Observed(CpG)/Expected(CpG) ? 0.65

10
I.2.1 Signature Splice junctions

Most Eukaryotic introns have a consensus splice
signal GU at the beginning (donor), AG at the
end (acceptor).
Variation does occur in the splice sites
Many AGs and GTs are not splice sites.
Recently, exon enhancers and insulators

11
I.2.1 Signature Hexanucleotide frequencies

Amino acid distributions are biased e.g. p(A) gt
p(C)
Pairwise distributions also biasede.g.
p(AT)/p(A)p(T) gt p(AC)/p(A)p(C)
Nucleotides that code for preferred amino acids
(and AA pairs) occur more frequently in coding
regions than in non-coding regions.
Codon biases (per amino acid)
Hexanucleotide distributions that reflect those
biases indicate coding regions.
Statistics has to be gathered over large amount
of sequences.

12
I.2.1 Signature Issues in 6mer frequency

Sliding window across all 6 reading frames.
Significance of a score?
In order to get good statistics on hexamer
frequencies in an ORF, it has to be long
Amino acid dimer (and nucleotide hexamer)
frequencies vary by organism.
Using frequencies from one organism (or a
consensus) for another gives a noisier signal yet.

13
1.2.2. Splice Site Prediction Inductive Learning

The inference of general rules from a set of
examples a classic problem in CS, statistics...
Training examples
Must be representative (random is best) and
labeled
Need an adequate number
Sometimes benefit from positive and negative
instances
Representation (what aspects of examples?)
Kind of rule to be induced (e.g. linear)
Algorithms for induction
Decision Tree
Support Vector Machine
Artificial Neural Networks
Bayesian Learning (Bayesian Network)
Instance-Based Learning
The key is __________ not so much in __________

14
1.2.2. Splice Site Prediction Representation

Most important aspect of inductive learning
Most common fixed length feature vector.
Feature observable value related to task
Fixed length vector a particular list of
features which has the same meaning from example
to example
Some feature sets (like sequences) have variable
length or can have variable meaning
Can translate by sliding window
Position Weighted Matrix (PWM)
Non-positional consensus sequences
Need all the relevant features and not too many
irrelevant ones (for amount of data)

15
1.2.2. Splice Site Prediction High Imbalanced
ratio

The difficulty High Imbalanced Ratio.
Assuming that all nucleotides have equal chance
of appearing. Then roughly speaking, the number
of expected GT is gt 3,000,000,000/16
Given that human has about 25000 genes and each
gene on the average has 10 exons. The number of
actual donor site is about 250,000.
Thus, the imbalanced ratio is roughly
16 250,000 3,000,000,000 4 3,000 1750!

16
1.2.2. Splice Site Prediction Approaches to
deadling with high imbalanced ratio

In bioinformatics
Under-sample the majority class
Splicing Predict splicing with known gene
boundary. Imbalanced ratio reduced by 1750 to
130. Assuming that coding sequence occupy 2 of
our genome.

17
1.2.2. Splice Site Prediction Better ways of
deadling with Imbalanced data

The basic idea
First cluster the majority classes, hopefully
there is a large collection of clusters with
almost pure majority.
If an unknown instance fall into one of these
clusters, then we say that the instance is a
negative instance.

-
-
-

-

-
-
-
-
-

-
-
-
-
-
-
18
1.2.2 Splice Site Prediction Performance
Measure for Imbalanced Data

If the imbanced ratio is high say 11000 then
simply say negative (assuming that the majority
is negative) then accuracy 99.9!
More meaningful measures are recall and precision
Recall proportion of positive (minority)
instances that are predicted (i.e. recalled) as
positive.
Precision proportion of positive instances
predicted correctly.
But Recall and Precision are complimentary goals,
so use F-measure the harmonic mean of recall
and precision as a comprimising measure.

19
1.2.2. Splice Site Prediction Kihoons Splice
Site Result.
20
1.2.3 Gene Assembly

There are three approaches
Homology-Based
Exon prediction-Based
HMM Approach
Combined other predictors

21
1.2.3. Gene Assembly Inference by homology

For exon finding, we need to find matches to
ESTs (start from TSS)
SwissProt or other protein databases (start from
TIS)
Known exons
Intronless Easy, match all multiple of 3 length
substrings from AUG to stop codon sequences
(UAG,UGG,UGA) against EST.
For (2), we need to convert the predicted coding
sequence to protein sequences.
For (3), certain organisms are particularly
useful. E.g., Tetradon nigroviridis
(Zebrafish) is a vertebrate with very few introns.

May need to backtrack
TSS
22
1.2.3. Gene Assembly ExoFish

Idea coding regions are more conserved than
non-coding regions
Uses sequences of Tetraodon nigroviridis,
Genome is 8 times smaller than human
Almost no repetitive sequence
Relatively few introns.
Most human genes have a homolog
70 complete sequence

23
1.2.3. Gene Assembly EST_Genome

ESTs (Expressed Sequence Tags) are random
sequencing of fragments of mRNAs. They are not
complete coding sequences, but are all coding
sequences
dbEST and other large databases exist (from
varying organisms). There are over 6 million
ESTs.
Matching ESTs to genomic sequence gives important
clues about coding sequences
Needs to handle introns in ESTs gracefully
AAT, GeneSeqer, SIM4. Spidey match EST
Procrusters. GeneWise, ORFgene, ALN match
protein
The match serve as evidence.

24
1.2.3. Gene Assembly Prediction-Based Method

Instead of trying to match EST, build an exon
predictor.
Splice site predictors generate candidate exons.
Need to determine real exons in the candidate
exons.
Train a classifier. Convert exons in known genes
and non-exons (those that have AG in front and
GT behind that are non-exons) to training
examples and train a classifier. Each exon is
assigned a confident score.
Use dynamic programming to assemble the exons
into genes.
Example GeneId, GeneView, GAP3, FGENE. DAGGER.
Advantage
Disadvantage

25
1.2.3. Gene Assembly Hidden Markov Model

Instead of predicting splice sites and then try
to match EST or use the predictor for
exon-prediction, generate splice sites and
predict exons together using HMM.

26
1.2.3. Gene Assembly Putting together multiple
predictions

None of these methods are ideal, so...
Best results comes from agreement among multiple
approaches
Computational integration. Many gene finders now
including EST or homology information
E.g. FGenesH, GenomeScan
Visualization environments for viewing multiple
predictions
Genome browsers (UCSC annotation tracks!),
Genotator

27
1.2.3. Gene Assembly GeneComber

Simple, open source PERL/MySQL program finds
agreement between HMMGene GenScan
Accuracy is higher when they agree
Common strategy for combining predictions
Can also use fancier combining principles, e.g.
NNs

28
1.3.1. Alternative Splicing

Estimated 50 of human genes have alternative
splicing.
Drosophilia Dscam (Down Symndrome cell adhesion
molecule) -- gt 38,000 isoforms! Present in human
on 21q22.
Recent results have demonstrated the importance
of alternative splicings (e.g. exon skipping, use
of partial exons), particularly in mammalian
genomes
Predictions of alternatively spliced gene
products are valuable
Knowledge of alternative splice possibilities can
be used to improve gene prediction

29
1.3.1. Alternative Splicing EuGÈNE

Takes alignments of alternatively spliced
transcripts, and enumerates all possible paths
through exons that could generate observations
Uses dynamic programming to find a minimal set

30
1.3.1. Alternative Splicing HMM path
distributions

Cawley Patcher, 05 used sampled distribution
of paths through a HMMGene to estimate
alternative splicing likelihoods
Contrast withViterbiassumption!

31
I.3.1. Alternative Aplicing Predictions of
unobserved AS

Chris Burges UNCOVER pair HMM (states cover two
sequences, saying either conserved or not)
Looks at one human/mouseorthologous intron to
estimatewhether it contains a splice
Alternative conserved exons detected with high
accuracy (much better than BLASTing)
100s of novel predictionsverified with PCR

32
I.3.2 RNA gene prediction

Not all gene products are proteins! RNA genes
are significant, too.
Hexamer frequency, etc. obviously not going to
work
Generally, different RNA genes have quite
different characteristics, so there are different
approaches to each family, e.g.
Bacterial RNA genes
tRNAs
miRNAs

33
I.3.2. RNA gene prediction miRNA predictions

Recent discoveries show growing importance of
micro RNAs in expression regulation
Chris Burge (again!) with MiRScan uses SVMs
Particularly tricky, since there is no known
negative set.
Uses a set that is probably mostly negative, and
resampling to estimate a threshold that allows
some negative examples to become positive.
Seeded with known positives for testing

34
I.3.3. TIS-predictor Previous result

Almost always, starts with ATG.
Some papers say It begins with ATG so it is
trivial.
Number of ATGs vs number of genes 13500.
Most results based on Pedersen and Nelson data
set with an imbalanced ratio of 13.

35
I.3.3. Promoter Predictor Previous Result

Promoter / TSS prediction
TSS prediction is essentially Promoter
prediction. RNA polymerase needs to bind to
promoter which is close to TSS.
TSS is more difficult since there is no consensus
sequence like ATG. Imbalanced ratio of gt 25,000
3,000,000,000 25,000 1 120,000
Other consensus sequences have been found
recently Clustering of DNA sequences in Human
promoters.
Typically place a window
Recently, window is reduced to -50 to 50,
with HMM. Result Recall and Precision about
20-30 .

4000
TSS
36
I.3.3. Promoter Prediction

Some popular classes of consensus sequences

TBP is a subunit of TFIID, which in turns is a
subunit of RNA polymerase II.
37
I.3.3. TIS/TSS/Promoter-Predictors Some ideas.

Kozak sequence (GCC)RCCATGG where R is a purine
(A or G). Possibly other sequences.
Kozaks result is fairly old (in the 80s), with
more genes known and better computational
techniques, there maybe other sequences.
Proposed project Find TIS, TSS, promoter)
Use the repeat database (on /mnt/nas1) to focus
on the areas that may contain genes. Perhaps
start with Chromosome 22 first.
Try to find non-positional consensus sequences as
features. (Kihoon is developing a generic program
for it)
Posibility of linking the two predictors If
there is no predicted TIS nearby then promoter
cannot exist. Need to compute statistics on how
far TIS is away from TSS (i.e. length of 5 UTR).

38
I.3.3. TIS/TSS/Promoter Predictors Some other
generalizations

Similar ideas can be applied to predict 3 end
UTR. Somehow polyA tail prediction receives less
attention but recently, talking to UTHSCSA
people, seems to be interesting.
For Promoter prediction perhaps predict the
subclass, but need to do some background work to
see if the subclass is interesting.
NOTE you have to give me the background! Try to
meet me and read at the very least 3-4 latest
papers! October 18! Which is 2 weeks time! Need
to meet me twice no kidding!

39
I.3.4. Beyond Gene Finding Gene Ontology
Predictions

Simply finding genes is not good enough. Need to
know their
Function What they do?
Localization Where they do it?
Process How they do it?
To this end, Gene Ontology Consortium comes up
with a list of vocabulary to describe gene
product attributes.
Problems with current gene prediction
Extremely high accuracy for multiclass problem,
gt 80. Problem with training data and homology
search
Perhaps an intermediate prediction is to predict
signal peptide and what signals does it provide.

40
I.3.4 Beyond Gene Finding Regulatory Network
Networking

Regulatory network How genes control other genes
(mainly through the production of transcription
factors or subunits of transcription factors).

41
I.3.4. Beyond Gene Finding Other Networks

Metabolic Pathway How cell takes in raw
material (sugar)
Signal Transduction How cell receives and
transmit signals
Currently, we are only interested in partial
regulatory network that involve regulation vis
promoter control.
Beyond this, tissue to tissue communication. Way
to go!

Write a Comment

User Comments (0)

About PowerShow.com

I' Gene Finding PowerPoint PPT Presentation