Title: I' Gene Finding
1I. Gene Finding
- I.1. Introduction
- I.2. Approaches to (Protein-Coding) Gene Finding
- I.2.1 Signature of Exons and non-Exons
- I.2.2 Approaches
- I.3. Other Topics
- I.3.1. Alternative Splicing
- I.3.2. RNA Gene Prediction
- I.3.3. Other Gene Components Prediction
- I.3.4. Gene Ontology Prediction
- --------------------------------------------------
----------------- - Readings Gene Annotation Prediction and Testing
2I.1. Introduction The importance of gene finding
- Complete sequencing of human genome. SO WHAT??!!!
- Given the sequence of a genome, we would like to
be able to identify - Genes
- Exon boundaries splice sites
- Beginning and end of translation
- Alternative splicings
- Regulatory elements (e.g. promoters)
- Subsequently, gene regulatory network, signaling
network, metabolic network. - Only certain way to do this is experimentally,
but computational methods can achieve reasonable
accuracy quickly, and help direct experimental
approaches.
3I.1. Introduction Eukaryotic gene structure
4I.1 Introduction NCBI RefSeq Sequences
- 5 years ago, estimated gt 100,000 human genes.
Now, 30000 ? 25000 - NCBI RefSeq
- RefSeq Code http//www.ncbi.nlm.nih.gov/RefSeq/ke
y.htmlstatus - Only 25,000? Well, isoforms.
????, not very reliable
Probably ok
Good
5I.1 Introduction Gene Prediction
- There is no (yet known) perfect method for
finding genes. All approaches rely on combining
various weak signals together - Find elements of a gene
- coding sequences (exons) finding splice site
- start signals Translation Initiation Site (TIS),
Transcription Start Site (TSS) - Promoter and other transcription factor binding
site - poly-A tails and downstream signals
- Assemble into a consistent gene model
6I.1 Introduction General challenges
- Short exons are hard to find, but not uncommon.
Shortest human exon is 3bp! - First and last exons are particularly difficult
- No bounding intron, so no splice site signal
(although start and end codons are related
signals) - Generally contain non-coding sequence as well as
coding sequence, so hexamer signals are diluted. - Alternative splicing (isoforms) means that there
are multiple true solutions for some genes. - For human genome, long introns of few thousands
are very common, making problem extremely
difficult.
7I.1. Introduction The importances of splice site
prediction
Perfect Prediction (ignoring isoform for now)
Missing out a gene
False negative
Computationally expansive
False positive
8I.2.1 Signature Signatures of exons and non-exons
- Signatures of splice sites (exons)
- CpG islands
- intron splice junctions
- (hexa)nucleotide frequencies
- Signatures of non-exons (repeats, ALUs, etc.)
- Compatibility with surrounding sequences other
exons (e.g. consistent reading frame)
9I.2.1 Signature CpG Islands
- CpG islands are regions of sequence that have a
high proportion of CG dinucleotide pairs (p is a
phoshodiester bond linking them) - CpG islands are present in the promoter and
exonic regions of approximately 40 of mammalian
genes - Other regions of the mammalian genome contain few
CpG dinucleotides and these are largely
methylated - Methylation disables transcription factor binding
- Definition sequences of gt500 bp with
- GC ? 55
- Observed(CpG)/Expected(CpG) ? 0.65
10I.2.1 Signature Splice junctions
- Most Eukaryotic introns have a consensus splice
signal GU at the beginning (donor), AG at the
end (acceptor). - Variation does occur in the splice sites
- Many AGs and GTs are not splice sites.
- Recently, exon enhancers and insulators
11I.2.1 Signature Hexanucleotide frequencies
- Amino acid distributions are biased e.g. p(A) gt
p(C) - Pairwise distributions also biasede.g.
p(AT)/p(A)p(T) gt p(AC)/p(A)p(C) - Nucleotides that code for preferred amino acids
(and AA pairs) occur more frequently in coding
regions than in non-coding regions. - Codon biases (per amino acid)
- Hexanucleotide distributions that reflect those
biases indicate coding regions. - Statistics has to be gathered over large amount
of sequences.
12I.2.1 Signature Issues in 6mer frequency
- Sliding window across all 6 reading frames.
- Significance of a score?
- In order to get good statistics on hexamer
frequencies in an ORF, it has to be long - Amino acid dimer (and nucleotide hexamer)
frequencies vary by organism. - Using frequencies from one organism (or a
consensus) for another gives a noisier signal yet.
131.2.2. Splice Site Prediction Inductive Learning
- The inference of general rules from a set of
examples a classic problem in CS, statistics... - Training examples
- Must be representative (random is best) and
labeled - Need an adequate number
- Sometimes benefit from positive and negative
instances - Representation (what aspects of examples?)
- Kind of rule to be induced (e.g. linear)
- Algorithms for induction
- Decision Tree
- Support Vector Machine
- Artificial Neural Networks
- Bayesian Learning (Bayesian Network)
- Instance-Based Learning
- The key is __________ not so much in __________
141.2.2. Splice Site Prediction Representation
- Most important aspect of inductive learning
- Most common fixed length feature vector.
- Feature observable value related to task
- Fixed length vector a particular list of
features which has the same meaning from example
to example - Some feature sets (like sequences) have variable
length or can have variable meaning - Can translate by sliding window
- Position Weighted Matrix (PWM)
- Non-positional consensus sequences
- Need all the relevant features and not too many
irrelevant ones (for amount of data)
151.2.2. Splice Site Prediction High Imbalanced
ratio
- The difficulty High Imbalanced Ratio.
- Assuming that all nucleotides have equal chance
of appearing. Then roughly speaking, the number
of expected GT is gt 3,000,000,000/16 - Given that human has about 25000 genes and each
gene on the average has 10 exons. The number of
actual donor site is about 250,000. - Thus, the imbalanced ratio is roughly
- 16 250,000 3,000,000,000 4 3,000 1750!
161.2.2. Splice Site Prediction Approaches to
deadling with high imbalanced ratio
- In bioinformatics
- Under-sample the majority class
- Splicing Predict splicing with known gene
boundary. Imbalanced ratio reduced by 1750 to
130. Assuming that coding sequence occupy 2 of
our genome.
171.2.2. Splice Site Prediction Better ways of
deadling with Imbalanced data
- The basic idea
- First cluster the majority classes, hopefully
there is a large collection of clusters with
almost pure majority. - If an unknown instance fall into one of these
clusters, then we say that the instance is a
negative instance. -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
181.2.2 Splice Site Prediction Performance
Measure for Imbalanced Data
- If the imbanced ratio is high say 11000 then
simply say negative (assuming that the majority
is negative) then accuracy 99.9! - More meaningful measures are recall and precision
- Recall proportion of positive (minority)
instances that are predicted (i.e. recalled) as
positive. - Precision proportion of positive instances
predicted correctly. - But Recall and Precision are complimentary goals,
so use F-measure the harmonic mean of recall
and precision as a comprimising measure.
191.2.2. Splice Site Prediction Kihoons Splice
Site Result.
201.2.3 Gene Assembly
- There are three approaches
- Homology-Based
- Exon prediction-Based
- HMM Approach
- Combined other predictors
211.2.3. Gene Assembly Inference by homology
- For exon finding, we need to find matches to
- ESTs (start from TSS)
- SwissProt or other protein databases (start from
TIS) - Known exons
- Intronless Easy, match all multiple of 3 length
substrings from AUG to stop codon sequences
(UAG,UGG,UGA) against EST. - For (2), we need to convert the predicted coding
sequence to protein sequences. - For (3), certain organisms are particularly
useful. E.g., Tetradon nigroviridis
(Zebrafish) is a vertebrate with very few introns.
May need to backtrack
TSS
221.2.3. Gene Assembly ExoFish
- Idea coding regions are more conserved than
non-coding regions - Uses sequences of Tetraodon nigroviridis,
- Genome is 8 times smaller than human
- Almost no repetitive sequence
- Relatively few introns.
- Most human genes have a homolog
- 70 complete sequence
231.2.3. Gene Assembly EST_Genome
- ESTs (Expressed Sequence Tags) are random
sequencing of fragments of mRNAs. They are not
complete coding sequences, but are all coding
sequences - dbEST and other large databases exist (from
varying organisms). There are over 6 million
ESTs. - Matching ESTs to genomic sequence gives important
clues about coding sequences - Needs to handle introns in ESTs gracefully
- AAT, GeneSeqer, SIM4. Spidey match EST
- Procrusters. GeneWise, ORFgene, ALN match
protein - The match serve as evidence.
241.2.3. Gene Assembly Prediction-Based Method
- Instead of trying to match EST, build an exon
predictor. - Splice site predictors generate candidate exons.
- Need to determine real exons in the candidate
exons. - Train a classifier. Convert exons in known genes
and non-exons (those that have AG in front and
GT behind that are non-exons) to training
examples and train a classifier. Each exon is
assigned a confident score. - Use dynamic programming to assemble the exons
into genes. - Example GeneId, GeneView, GAP3, FGENE. DAGGER.
- Advantage
- Disadvantage
251.2.3. Gene Assembly Hidden Markov Model
- Instead of predicting splice sites and then try
to match EST or use the predictor for
exon-prediction, generate splice sites and
predict exons together using HMM.
261.2.3. Gene Assembly Putting together multiple
predictions
- None of these methods are ideal, so...
- Best results comes from agreement among multiple
approaches - Computational integration. Many gene finders now
including EST or homology information - E.g. FGenesH, GenomeScan
- Visualization environments for viewing multiple
predictions - Genome browsers (UCSC annotation tracks!),
Genotator
271.2.3. Gene Assembly GeneComber
- Simple, open source PERL/MySQL program finds
agreement between HMMGene GenScan - Accuracy is higher when they agree
- Common strategy for combining predictions
- Can also use fancier combining principles, e.g.
NNs
281.3.1. Alternative Splicing
- Estimated 50 of human genes have alternative
splicing. - Drosophilia Dscam (Down Symndrome cell adhesion
molecule) -- gt 38,000 isoforms! Present in human
on 21q22. - Recent results have demonstrated the importance
of alternative splicings (e.g. exon skipping, use
of partial exons), particularly in mammalian
genomes - Predictions of alternatively spliced gene
products are valuable - Knowledge of alternative splice possibilities can
be used to improve gene prediction
291.3.1. Alternative Splicing EuGÈNE
- Takes alignments of alternatively spliced
transcripts, and enumerates all possible paths
through exons that could generate observations - Uses dynamic programming to find a minimal set
301.3.1. Alternative Splicing HMM path
distributions
- Cawley Patcher, 05 used sampled distribution
of paths through a HMMGene to estimate
alternative splicing likelihoods - Contrast withViterbiassumption!
31I.3.1. Alternative Aplicing Predictions of
unobserved AS
- Chris Burges UNCOVER pair HMM (states cover two
sequences, saying either conserved or not) - Looks at one human/mouseorthologous intron to
estimatewhether it contains a splice - Alternative conserved exons detected with high
accuracy (much better than BLASTing) - 100s of novel predictionsverified with PCR
32I.3.2 RNA gene prediction
- Not all gene products are proteins! RNA genes
are significant, too. - Hexamer frequency, etc. obviously not going to
work - Generally, different RNA genes have quite
different characteristics, so there are different
approaches to each family, e.g. - Bacterial RNA genes
- tRNAs
- miRNAs
33I.3.2. RNA gene prediction miRNA predictions
- Recent discoveries show growing importance of
micro RNAs in expression regulation - Chris Burge (again!) with MiRScan uses SVMs
- Particularly tricky, since there is no known
negative set. - Uses a set that is probably mostly negative, and
resampling to estimate a threshold that allows
some negative examples to become positive.
Seeded with known positives for testing
34I.3.3. TIS-predictor Previous result
- Almost always, starts with ATG.
- Some papers say It begins with ATG so it is
trivial. - Number of ATGs vs number of genes 13500.
- Most results based on Pedersen and Nelson data
set with an imbalanced ratio of 13.
35I.3.3. Promoter Predictor Previous Result
- Promoter / TSS prediction
- TSS prediction is essentially Promoter
prediction. RNA polymerase needs to bind to
promoter which is close to TSS. - TSS is more difficult since there is no consensus
sequence like ATG. Imbalanced ratio of gt 25,000
3,000,000,000 25,000 1 120,000 - Other consensus sequences have been found
recently Clustering of DNA sequences in Human
promoters. - Typically place a window
- Recently, window is reduced to -50 to 50,
with HMM. Result Recall and Precision about
20-30 .
4000
TSS
36I.3.3. Promoter Prediction
- Some popular classes of consensus sequences
TBP is a subunit of TFIID, which in turns is a
subunit of RNA polymerase II.
37I.3.3. TIS/TSS/Promoter-Predictors Some ideas.
- Kozak sequence (GCC)RCCATGG where R is a purine
(A or G). Possibly other sequences. - Kozaks result is fairly old (in the 80s), with
more genes known and better computational
techniques, there maybe other sequences. - Proposed project Find TIS, TSS, promoter)
- Use the repeat database (on /mnt/nas1) to focus
on the areas that may contain genes. Perhaps
start with Chromosome 22 first. - Try to find non-positional consensus sequences as
features. (Kihoon is developing a generic program
for it) - Posibility of linking the two predictors If
there is no predicted TIS nearby then promoter
cannot exist. Need to compute statistics on how
far TIS is away from TSS (i.e. length of 5 UTR).
38I.3.3. TIS/TSS/Promoter Predictors Some other
generalizations
- Similar ideas can be applied to predict 3 end
UTR. Somehow polyA tail prediction receives less
attention but recently, talking to UTHSCSA
people, seems to be interesting. - For Promoter prediction perhaps predict the
subclass, but need to do some background work to
see if the subclass is interesting. - NOTE you have to give me the background! Try to
meet me and read at the very least 3-4 latest
papers! October 18! Which is 2 weeks time! Need
to meet me twice no kidding!
39I.3.4. Beyond Gene Finding Gene Ontology
Predictions
- Simply finding genes is not good enough. Need to
know their - Function What they do?
- Localization Where they do it?
- Process How they do it?
- To this end, Gene Ontology Consortium comes up
with a list of vocabulary to describe gene
product attributes. - Problems with current gene prediction
- Extremely high accuracy for multiclass problem,
gt 80. Problem with training data and homology
search - Perhaps an intermediate prediction is to predict
signal peptide and what signals does it provide.
40I.3.4 Beyond Gene Finding Regulatory Network
Networking
- Regulatory network How genes control other genes
(mainly through the production of transcription
factors or subunits of transcription factors).
41I.3.4. Beyond Gene Finding Other Networks
- Metabolic Pathway How cell takes in raw
material (sugar) - Signal Transduction How cell receives and
transmit signals - Currently, we are only interested in partial
regulatory network that involve regulation vis
promoter control. - Beyond this, tissue to tissue communication. Way
to go!