Gone Fishing: An Introduction to Gene Finding Methods

About This Presentation

Title:

Gone Fishing: An Introduction to Gene Finding Methods

Description:

Additional materials for those who missed ... 5' Pr 5'UTR E1 I1 E2 I2 E3 3'UTR polyA 3' ...TATA... ...AATAAA... ATG. ... TATA TBP transcription initiation ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 15

Provided by: foldin

Category:

more less

Transcript and Presenter's Notes

Title: Gone Fishing: An Introduction to Gene Finding Methods

1
Gone FishingAn Introduction to Gene Finding
Methods
Jarek Meller Biomedical Informatics,
CHRF Additional materials for those who
missed The Intro to Functional Genomics course
2
A couple of definitions

A short history of genes from hereditary basis
for traits to one gene one polypeptide
Modern definition of the gene a complete
chromosomal segment responsible for making a
functional product
Codon a triplet of nucleotides encoding an amino
acid
Open Reading Frame (ORF) a string of codons
bounded by start and stop signals (codons)
Pseudogene a potential gene with an impaired
ability to make viable transcription (or
translation) product

3
The Canonical Structure of Eukaryotic Genes
TATA
AATAAA
5 Pr 5UTR E1
I1 E2 I2 E3
3UTR polyA 3
Eukaryotic genes are in general neither
contiguous nor continuous coding regions are
typically split in a number of coding fragments
(exons), separated by non-coding intervening
fragments known as introns.
4
Motifs and Processes
TATA TBP transcription initiation AATAAA
poly-A polymerase poly-A tail attachment
(pre-mRNA processing) GT AG splicesome
complex splicing (pre-mRNA processing) ATG
ribosome complex translation initiation TGA,
TAA, TAG ribosome complex translation
termination
5
From Transcription to pre-RNA Processing
to Splicing to Translation
6
Finding Genes in Prokaryotic Genomes
AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
f0 AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAa
f1 aAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
f2 aaAATGGGGGTGGGTGATGAGAGACTTAGATGAATaa

A simple algorithm for each reading frame on
both strands
Find start (ATG) and stop (TGA, TAG, TAA) codons
Find sufficiently long (threshold) Open Reading
Frames (ORF)
For each ORF compute a coding potential, e.g.
using codon usage
ORFs with sufficiently high score become
candidate genes
Refinements alternative coding measures,
homology, regulatory motifs

7
Finding Genes in Eukaryotic Genomes
AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
g1 aaaATGGGGgtgggtgatgagagACTTAGatgaataa
MetGly Thr
g2 aaaATGGGGGTGGgtgatgagagACTTAGAT
GAATAA MetGlyValA spLeuAspGlu
A legal parse (candidate gene) must have a single
ORF spanning all coding regions from the start to
the stop codon.
8
Further complications

Alternative splicing
Alternative transcription initiation sites and
start codons
Overlapping (and embedded) genes
Regulatory sites often separated by long
intervening non-coding sequences
Pseudogenes

9
Signals a simple approach by using Weight
Matrices
GAGGTAAGC CAGGTCAGT TCGGTAATT ATGGTAACT TAGGTCATT
Further refinements supervised machine learning
approaches e.g. NN
10
Coding measures a simple codon usage model
The decomposition of sequence S into codons Ck
is reading frame dependent and all reading frames
are considered for prediction (that is maximum
score over all reading frames with a sufficiently
long sliding window is taken). However, only the
reading frames are used to generate probabilities
of each codon (see Codon Usage Table) in the
training set of true exons. The background
probabilities, in turn, may be computed from all
the sequences (including introns) in the
training, taking into account all the reading
frames. Refinement use homology and splice
alignments
11
R. Guigo, sliding window of length 120 b, human
beta-globulin
12
(No Transcript)
13
Combining Sites and Coding Statistics

Variety of approaches proposed, e.g. MORGAN,
FGENES, GeneID, GRAIL
The dynamic programming framework find the best
legal parse up to position n, given the best
scoring and consistent parses up to position n-1
(analogy to sequence alignment)
Hidden Markov Model statistical learning
framework for gene finding

14
Problems and assignments

Use a eukaryotic genomic sequence from the
GENBANK of length larger than 20 kb to estimate
the frequency of putative donor and acceptor
sites
Use true splice sites in your sequence to derive
0-th and first order Markov models (weight
matrices)
Compare the results of the two models for false
sites in the sequence
Consider splice alignments into protein (cDNA)
sequence databases as a method to detect coding
sequences. What would be the role of the six
reading frames in such an exercise?