Gone Fishing: An Introduction to Gene Finding Methods - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Gone Fishing: An Introduction to Gene Finding Methods

Description:

Additional materials for those who missed ... 5' Pr 5'UTR E1 I1 E2 I2 E3 3'UTR polyA 3' ...TATA... ...AATAAA... ATG. ... TATA TBP transcription initiation ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 15
Provided by: foldin
Category:

less

Transcript and Presenter's Notes

Title: Gone Fishing: An Introduction to Gene Finding Methods


1
Gone FishingAn Introduction to Gene Finding
Methods
Jarek Meller Biomedical Informatics,
CHRF Additional materials for those who
missed The Intro to Functional Genomics course
2
A couple of definitions
  • A short history of genes from hereditary basis
    for traits to one gene one polypeptide
  • Modern definition of the gene a complete
    chromosomal segment responsible for making a
    functional product
  • Codon a triplet of nucleotides encoding an amino
    acid
  • Open Reading Frame (ORF) a string of codons
    bounded by start and stop signals (codons)
  • Pseudogene a potential gene with an impaired
    ability to make viable transcription (or
    translation) product

3
The Canonical Structure of Eukaryotic Genes
TATA
AATAAA
5 Pr 5UTR E1
I1 E2 I2 E3
3UTR polyA 3
Eukaryotic genes are in general neither
contiguous nor continuous coding regions are
typically split in a number of coding fragments
(exons), separated by non-coding intervening
fragments known as introns.
4
Motifs and Processes
TATA TBP transcription initiation AATAAA
poly-A polymerase poly-A tail attachment
(pre-mRNA processing) GT AG splicesome
complex splicing (pre-mRNA processing) ATG
ribosome complex translation initiation TGA,
TAA, TAG ribosome complex translation
termination
5
From Transcription to pre-RNA Processing
to Splicing to Translation
6
Finding Genes in Prokaryotic Genomes
AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
f0 AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAa
f1 aAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
f2 aaAATGGGGGTGGGTGATGAGAGACTTAGATGAATaa
  • A simple algorithm for each reading frame on
    both strands
  • Find start (ATG) and stop (TGA, TAG, TAA) codons
  • Find sufficiently long (threshold) Open Reading
    Frames (ORF)
  • For each ORF compute a coding potential, e.g.
    using codon usage
  • ORFs with sufficiently high score become
    candidate genes
  • Refinements alternative coding measures,
    homology, regulatory motifs

7
Finding Genes in Eukaryotic Genomes
AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA
g1 aaaATGGGGgtgggtgatgagagACTTAGatgaataa
MetGly Thr
g2 aaaATGGGGGTGGgtgatgagagACTTAGAT
GAATAA MetGlyValA spLeuAspGlu
A legal parse (candidate gene) must have a single
ORF spanning all coding regions from the start to
the stop codon.
8
Further complications
  • Alternative splicing
  • Alternative transcription initiation sites and
    start codons
  • Overlapping (and embedded) genes
  • Regulatory sites often separated by long
    intervening non-coding sequences
  • Pseudogenes

9
Signals a simple approach by using Weight
Matrices
GAGGTAAGC CAGGTCAGT TCGGTAATT ATGGTAACT TAGGTCATT
Further refinements supervised machine learning
approaches e.g. NN
10
Coding measures a simple codon usage model
The decomposition of sequence S into codons Ck
is reading frame dependent and all reading frames
are considered for prediction (that is maximum
score over all reading frames with a sufficiently
long sliding window is taken). However, only the
reading frames are used to generate probabilities
of each codon (see Codon Usage Table) in the
training set of true exons. The background
probabilities, in turn, may be computed from all
the sequences (including introns) in the
training, taking into account all the reading
frames. Refinement use homology and splice
alignments
11
R. Guigo, sliding window of length 120 b, human
beta-globulin
12
(No Transcript)
13
Combining Sites and Coding Statistics
  • Variety of approaches proposed, e.g. MORGAN,
    FGENES, GeneID, GRAIL
  • The dynamic programming framework find the best
    legal parse up to position n, given the best
    scoring and consistent parses up to position n-1
    (analogy to sequence alignment)
  • Hidden Markov Model statistical learning
    framework for gene finding

14
Problems and assignments
  • Use a eukaryotic genomic sequence from the
    GENBANK of length larger than 20 kb to estimate
    the frequency of putative donor and acceptor
    sites
  • Use true splice sites in your sequence to derive
    0-th and first order Markov models (weight
    matrices)
  • Compare the results of the two models for false
    sites in the sequence
  • Consider splice alignments into protein (cDNA)
    sequence databases as a method to detect coding
    sequences. What would be the role of the six
    reading frames in such an exercise?
Write a Comment
User Comments (0)
About PowerShow.com