Eukaryotic Gene Finding - PowerPoint PPT Presentation

About This Presentation
Title:

Eukaryotic Gene Finding

Description:

On average, a vertebrate gene is about 30KB long. Coding region takes about 1KB ... Esngl single exon (intronless) (translation start - stop codon) ... – PowerPoint PPT presentation

Number of Views:414
Avg rating:3.0/5.0
Slides: 28
Provided by: iftachn
Category:

less

Transcript and Presenter's Notes

Title: Eukaryotic Gene Finding


1
Eukaryotic Gene Finding
Adapted in part from http//online.itp.ucsb.edu/on
line/infobio01/burge/
2
Prokaryotic vs. Eukaryotic Genes
  • Prokaryotes
  • small genomes
  • high gene density
  • no introns (or splicing)
  • no RNA processing
  • similar promoters
  • overlapping genes
  • Eukaryotes
  • large genomes
  • low gene density
  • introns (splicing)
  • RNA processing
  • heterogeneous promoters
  • polyadenylation

3
(No Transcript)
4
Pre-mRNA Splicing
exon definition
intron definition
...
(assembly of
spliceosome,
catalysis)
...
5
(No Transcript)
6
Some Statistics
  • On average, a vertebrate gene is about 30KB long
  • Coding region takes about 1KB
  • Exon sizes can vary from double digit numbers to
    kilobases
  • An average 5 UTR is about 750 bp
  • An average 3UTR is about 450 bp but both can be
    much longer.

7
Human Splice Signal Motifs
5' splice signal
3' splice signal
8
(No Transcript)
9
(No Transcript)
10
Semi-Markov HMM Model
11
Genscan HSMM
12
GenScan States
  • N - intergenic region
  • P - promoter
  • F - 5 untranslated region
  • Esngl single exon (intronless) (translation
    start -gt stop codon)
  • Einit initial exon (translation start -gt donor
    splice site)
  • Ek phase k internal exon (acceptor splice site
    -gt donor splice site)
  • Eterm terminal exon (acceptor splice site -gt
    stop codon)
  • Ik phase k intron 0 between codons 1
    after the first base of a codon 2 after the
    second base of a codon

13
GenScan features
  • Model both strands at once
  • Each state may output a string of symbols
    (according to some probability distribution).
  • Explicit intron/exon length modeling
  • Advanced splice site modeling
  • Parameters learned from annotated genes
  • Separate parameter training for different CpG
    content groups

14
(No Transcript)
15
GenScan Signal Modeling
  • PSSM P(S) P1(S1)P2(S2) Pn(Sn)
  • PolyA signal
  • Translation initiation/termination signal
  • Promoters
  • WAM P(S) P1(S1) P2(S2S1)Pn(SnSn-1)
  • 5 and 3 splice sites

16
HMM-based Gene Finding
  • GENSCAN (Burge 1997)
  • FGENESH (Solovyev 1997)
  • HMMgene (Krogh 1997)
  • GENIE (Kulp 1996)
  • GENMARK (Borodovsky McIninch 1993)
  • VEIL (Henderson, Salzberg, Fasman 1997)

17
GenomeScan
  • Idea We can enhance our gene prediction by
    using external information DNA regions with
    homology to known proteins are more likely to be
    coding exons.
  • Combine probabilistic extrinsic information
    (BLAST hits) with a probabilistic model of gene
    structure/composition (GenScan)
  • Focus on typical case when homologous but not
    identical
  • proteins are available.

18
(No Transcript)
19
(No Transcript)
20
GeneWise Birney, Amitai
  • Motivation Use good DB of protein world (PFAM)
    to help us annotate genomic DNA
  • GeneWise algorithm aligns a profile HMM directly
    to the DNA

21
Sample GeneWise Output
22
Developing GeneWise Model
  • Start with a PFAM domain HMM
  • Replace AA emissions with codon emissions
  • Allow for sequencing errors (deletions/insertions)
  • Add a 3-state intron model

23
GeneWise Model
24
GeneWise Intron Model
5 site
3 site
25
GeneWise Model
  • Viterbi algorithm -gt best alignment of DNA to
    protein domain
  • Alignment gives exact exon-intron boundaries
  • Parameters learned from species-specific
    statistics

26
GeneWise problems
  • Only provides partial prediction, and only where
    the homology lies
  • Does not find more genes
  • Pseudogenes, Retrotransposons picked up
  • CPU intensive
  • Solution Pre-filter with BLAST

27
Summary
  • Genes are complex structures which are difficult
    to predict with the required level of
    accuracy/confidence
  • Different approaches to gene finding
  • Ab Initio GenScan
  • Ab Initio modified by BLAST homologies
    GenomeScan
  • Homology guided GeneWise
Write a Comment
User Comments (0)
About PowerShow.com