Title: Gone Fishing: An Introduction to Gene Finding Methods
1Gone FishingAn Introduction to Gene Finding
Methods
Jarek Meller Biomedical Informatics,
CHRF Additional materials for those who
missed The Intro to Functional Genomics course
2A couple of definitions
- A short history of genes from hereditary basis
for traits to one gene one polypeptide - Modern definition of the gene a complete
chromosomal segment responsible for making a
functional product - Codon a triplet of nucleotides encoding an amino
acid - Open Reading Frame (ORF) a string of codons
bounded by start and stop signals (codons) - Pseudogene a potential gene with an impaired
ability to make viable transcription (or
translation) product
3The Canonical Structure of Eukaryotic Genes
TATA
AATAAA
5 Pr 5UTR E1
I1 E2 I2 E3
3UTR polyA 3
Eukaryotic genes are in general neither
contiguous nor continuous coding regions are
typically split in a number of coding fragments
(exons), separated by non-coding intervening
fragments known as introns.
4Motifs and Processes
TATA TBP transcription initiation AATAAA
poly-A polymerase poly-A tail attachment
(pre-mRNA processing) GT AG splicesome
complex splicing (pre-mRNA processing) ATG
ribosome complex translation initiation TGA,
TAA, TAG ribosome complex translation
termination
5From Transcription to pre-RNA Processing
to Splicing to Translation
6Finding Genes in Prokaryotic Genomes
AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
f0 AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAa
f1 aAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
f2 aaAATGGGGGTGGGTGATGAGAGACTTAGATGAATaa
- A simple algorithm for each reading frame on
both strands - Find start (ATG) and stop (TGA, TAG, TAA) codons
- Find sufficiently long (threshold) Open Reading
Frames (ORF) - For each ORF compute a coding potential, e.g.
using codon usage - ORFs with sufficiently high score become
candidate genes - Refinements alternative coding measures,
homology, regulatory motifs
7Finding Genes in Eukaryotic Genomes
AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
g1 aaaATGGGGgtgggtgatgagagACTTAGatgaataa
MetGly Thr
g2 aaaATGGGGGTGGgtgatgagagACTTAGAT
GAATAA MetGlyValA spLeuAspGlu
A legal parse (candidate gene) must have a single
ORF spanning all coding regions from the start to
the stop codon.
8Further complications
- Alternative splicing
- Alternative transcription initiation sites and
start codons - Overlapping (and embedded) genes
- Regulatory sites often separated by long
intervening non-coding sequences - Pseudogenes
9Signals a simple approach by using Weight
Matrices
GAGGTAAGC CAGGTCAGT TCGGTAATT ATGGTAACT TAGGTCATT
Further refinements supervised machine learning
approaches e.g. NN
10Coding measures a simple codon usage model
The decomposition of sequence S into codons Ck
is reading frame dependent and all reading frames
are considered for prediction (that is maximum
score over all reading frames with a sufficiently
long sliding window is taken). However, only the
reading frames are used to generate probabilities
of each codon (see Codon Usage Table) in the
training set of true exons. The background
probabilities, in turn, may be computed from all
the sequences (including introns) in the
training, taking into account all the reading
frames. Refinement use homology and splice
alignments
11R. Guigo, sliding window of length 120 b, human
beta-globulin
12(No Transcript)
13Combining Sites and Coding Statistics
- Variety of approaches proposed, e.g. MORGAN,
FGENES, GeneID, GRAIL - The dynamic programming framework find the best
legal parse up to position n, given the best
scoring and consistent parses up to position n-1
(analogy to sequence alignment) - Hidden Markov Model statistical learning
framework for gene finding
14Problems and assignments
- Use a eukaryotic genomic sequence from the
GENBANK of length larger than 20 kb to estimate
the frequency of putative donor and acceptor
sites - Use true splice sites in your sequence to derive
0-th and first order Markov models (weight
matrices) - Compare the results of the two models for false
sites in the sequence - Consider splice alignments into protein (cDNA)
sequence databases as a method to detect coding
sequences. What would be the role of the six
reading frames in such an exercise?