Gene Prediction and Genome Annotation - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Gene Prediction and Genome Annotation

Description:

TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl ... rRNA, snRNA, miRNA, etc.: COVE ... Gene discovery using ESTs ... – PowerPoint PPT presentation

Number of Views:256
Avg rating:3.0/5.0
Slides: 42
Provided by: uwe91
Category:

less

Transcript and Presenter's Notes

Title: Gene Prediction and Genome Annotation


1
Gene PredictionandGenome Annotation
The Genome Access Course February, 2003
2
How can we get from here
3
to here,
4
he r e,
5
and he r e?
6
Without
resorting
to this
7
What is a gene?
  • Stretch of DNA that contains the information for
    the building of protein(s)
  • Dynamic concept, consider
  • Prokaryotic vs. eukaryotic gene models
  • Introns/exons
  • Posttranscriptional modifications
  • Alternative splicing
  • Differential expression
  • Genes-in-genes
  • Genes-ad-genes
  • Posttranslational modifications
  • Multi-subunit proteins

8
Prokaryotic gene model
  • Small genomes, high gene density
  • Haemophilus influenza genome 85 genic
  • Operons
  • One transcript, many genes
  • No introns.
  • One gene, one protein
  • Open reading frames
  • One ORF per gene
  • ORFs begin with start,
  • end with stop codon (def.)

TIGR http//www.tigr.org/tigr-scripts/CMR2/CMRGen
omes.spl NCBI http//www.ncbi.nlm.nih.gov/PMGifs/
Genomes/micr.html
9
Eukaryotic gene model
  • Large genomes, low gene density
  • Homo sapiens genome 25 genic
  • Posttranscriptional modification
  • 5-CAP, polyA tail, splicing
  • Open reading frames
  • One ORF per exon, dont all contain starts/stops
  • Mature mRNA contains ORF par definitionem
  • Multiple translates
  • One gene many proteins via alternative splicing
  • Posttranscriptional modification
  • 5-CAP, polyA tail, splicing
  • Open reading frames
  • One ORF per exon, dont all contain starts/stops
  • Mature mRNA contains ORF par definitionem
  • Multiple translates
  • One gene many proteins via alternative splicing

10
Where do genes live?
  • In genomes
  • Example human genome
  • Ca. 3,200,000,000 base pairs
  • 25 chromosomes 1-22, X, Y, mt
  • 28,000-45,000 genes (current estimate)
  • 128 nucleotides (RNA gene) 2,800 kb (DMD)
  • Ca. 25 of genome are genes (introns, exons)
  • Ca. 1 of genome codes for amino acids (CDS)
  • 30 kb gene length (average)
  • 1.4 kb ORF length (average)
  • 3 transcripts per gene (average)

11
Sample genomes
 List of 68 eukaryotes, 141 bacteria, and 17
archaea at http//www.ncbi.nlm.nih.gov/PMGifs/Geno
mes/links2a.html
12
The value of genome sequences lies in their
annotation
  • Annotation Characterizing genomic features
    using computational and experimental methods
  • Genes Four levels of annotation
  • Gene Prediction Where are genes?
  • What do they look like?
  • Domains What do the proteins do?
  • Role What pathway(s) involved in?

13
Genomic sequence features I
  • Repeats
  • Transposable elements, simple repeats
  • RepeatMasker
  • Uses Smith Waterman algorithm to align sequences
    to known repeats
  • Advantage avoid spurious matches to repetitive
    elements
  • Disadvantage mask sequences that gene
    prediction programs may need for statistics

14
Genomic sequence features II
  • Non-coding RNAs (ncRNA)
  • tRNA tRNASCAN-SE
  • Identifies candidates by scanning for pol III
    promoters. Then it uses an algorithm to determine
    the RNA.
  • rRNA, snRNA, miRNA, etc. COVE
  • Identified by performing similarity searches and
    RNA structure analysis using covariance models.

15
Genomic sequence features III
  • Genes
  • Vary in density, length, structure, number of
    introns, number of splice forms, etc.
  • Identification method depends on evidence,
    expertise and methods available.
  • Gene identification usually requires concerted
    application of bioinformatics methods and wet
    experimentation.
  • Pseudo genes
  • Look-a-likes of genes not transcribed. Obstruct
    gene finding efforts.

16
Gene identification
  • Homology-based gene prediction
  • Similarity Searches (e.g. BLAST, BLAT)
  • Genome Browsers
  • RNA evidence (ESTs)
  • Ab initio gene prediction
  • Gene prediction programs
  • Prokaryotes
  • ORF identification
  • Eukaryotes
  • Promoter prediction
  • PolyA-signal prediction
  • Splice site, start/stop-codon predictions

17
Gene prediction through comparative genomics
  • Purifying selection Conserved regions between
    two genomes are useful or else they would have
    diverged.
  • If genomes are too close in the phylogenetic
    tree, there may be too much noise.
  • If genomes are too far apart, analogous regions
    may be missed.

18
Genome Browsers
NCBI Map Viewer www.ncbi.nlm.nih.gov/mapview/
Generic Genome Browser (CSHL) www.wormbase.org/db
/seq/gbrowse
Ensembl Genome Browser www.ensembl.org/
UCSC Genome Browser genome.ucsc.edu/cgi-bin/hgGate
way?orghuman
Apollo Genome Browser www.bdgp.org/annot/apollo/
19
Gene discovery using ESTs
  • Expressed Sequence Tags (ESTs) represent
    sequences from expressed genes.
  • If region matches EST with high stringency then
    region is probably a gene or pseudo gene.
  • EST overlapping exon boundary gives an accurate
    prediction of exon boundary.

20
Tools for EST analysis
  • BLAST (Basic Local Alignment Search Tool)
  • Smaller exons will be missed due to smaller
    score.
  • SIM4 (http//pbil.univ-lyon1.fr/sim4.html)
  • Useful tool to map a gene to genomic sequence
  • Allows for large gaps (introns)

21
Some limitations of ESTs
  • Usually ESTs are not full length, posing
    challenges to identifying complete gene.
  • Genes with low levels of expression or expression
    limited to certain conditions may not be
    represented in EST library.
  • Smaller exons will still be missed because match
    is not significant enough.
  • Alternative splice forms may obstruct
    identification of exon extents.

22
Ab initio gene prediction
  • Prokaryotes
  • ORF-Detectors
  • Eukaryotes
  • Position, extent direction through promoter
    and polyA-signal predictors
  • Structure through splice site predictors
  • Exact location of coding sequences through
    determination of relationships between potential
    start codons, splice sites, ORFs, and stop codons

23
Tools
  • ORF detectors
  • NCBI http//www.ncbi.nih.gov/gorf/gorf.html
  • Promoter predictors
  • CSHL http//rulai.cshl.org/software/index1.htm
  • BDGP fruitfly.org/seq_tools/promoter.html
  • ICG TATA-Box predictor
  • PolyA signal predictors
  • CSHL argon.cshl.org/tabaska/polyadq_form.html
  • Splice site predictors
  • BDGP http//www.fruitfly.org/seq_tools/splice.htm
    l
  • Start-/stop-codon identifiers
  • DNALC Translator/ORF-Finder
  • BCM Searchlauncher

24
How it works I Motif identification
  • Exon-Intron Borders Splice Sites

Exon Intron
Exon  gaggcatcaggtttgtagactgtgtttcag
tgcacccact ccgccgctgagtgagccgtgtc
tattctaggacgcgcggg tgtgaattaggtaagaggtt
atatctccagatggagatca ccatgaggaggtgagtg
ccattatttccaggtatgagacg
Splice site Splice site
Exon Intron
Exon  gaggcatcagGTttgtagactgtgtttcAG
tgcacccact ccgccgctgaGTgagccgtgtc
tattctAGgacgcgcggg tgtgaattagGTaagaggtt
atatctccAGatggagatca ccatgaggagGTgagtg
ccattatttccAGgtatgagacg
Splice site Splice site
Motif Extraction Programs at http//www-btls.jst.g
o.jp/
25
How it works II - Movies
Pribnow-Box Finder 0/1 Pribnow-Box Finder all
26
How it works III The (ugly) truth
27
Gene prediction programs
  • Rule-based programs
  • Use explicit set of rules to make decisions.
  • Example GeneFinder
  • Neural Network-based programs
  • Use data set to build rules.
  • Examples Grail, GrailEXP
  • Hidden Markov Model-based programs
  • Use probabilities of states and transitions
    between these states to predict features.
  • Examples Genscan, GenomeScan

28
Rule Based - GeneFinder
  • Compares expected vs. observed frequencies to
    score features such as codon bias via Log
    Likelihood Ratios (LLR).
  • Each position of a sequence is scored in respect
    to its potential of being a splice site or
    translational start site.
  • LLR scores and ORF identification determine
    maximum-length coding segments.
  • Total score for a gene is the sum of exon scores
    minus the gap penalty.
  • Rather bad for first and last exons.

29
Neural Networks - Grail, GrailEXP
  • Utilizes sensors trained on a set of known genes
    of the organism.
  • Sensors examine
  • Frame Bias Matrix - Uses codon bias to determine
    ORFs.
  • Fickett Pentamer position weight matrices.
  • Dinucleotide Fractal Dimensions - Transition of
    sequential dinucleotides is represented as
    fractal dimension. CDS differ from nCDS.
  • GrailExp incorporates similarity-based method by
    adding a blastn component to its prediction
    algorithm. Runs reliably on unmasked sequences.

30
HMM Genscan, Genomescan
  • Genscan uses known transcriptional and
    translational signals and then uses HMM to model
    coding and non-coding regions.
  • Genomescan incorporates similarity-based method
    by adding a blastX component to its prediction
    algorithm, using the translated sequence to
    search protein db.

31
Burge, C. and S. Karlin, Prediction of complete
gene structures in human genomic DNA. J Mol Biol,
1997. 268(1) p. 78-94
32
Evaluating prediction programs
  • Sensitivity vs. Specificity
  • Sensitivity
  • How many genes were found out of all present?
  • Sn TP/(TPFN)
  • Specificity
  • How many predicted genes are indeed genes?
  • Sp TP/(TPFP)
  • Programs that combine statistical evaluations
    with similarity searches most powerful.

33
Gene prediction accuracies
  • Nucleotide level 95Sn, 90Sp (Lows less than
    50)
  • Exon level 75Sn, 68Sp (Lows less than 30)
  • Gene Level 40 Sn, 30Sp (Lows less than 10)
  • Selected readings
  • Parra et al. (2003). Comparative Gene Prediction
    in Human and Mouse. Genome Research 13108-117.
  • Rogic et al. (2001). Evaluation of Gene-Finding
    Programs in Mammalian Sequences. Genome Research
    11817-832.
  • Guigo et al. (2000). An Assessment of Gene
    Prediction Accuracy in Large DNA Sequences.
    Genome Research 101631-1642.
  • Reese et al. (2000). Genome Annotation Assessment
    in Drosophila melanogaster. Genome Research
    10483-501.
  • Burge and Karlin (1997). Prediction of Complete
    Gene Structures in Human Genomic DNA Tab. 1. JMB
    26878-94.

34
Common difficulties
  • First and last exons difficult to annotate
    because they contain UTRs.
  • Smaller genes are not statistically significant
    so they are thrown out.
  • Algorithms are trained with sequences from known
    genes which biases them against genes about which
    nothing is known.
  • Masking repeats frequently removes chunks from
    the untranslated regions of genes that contain
    repetitive elements.

35
The annotation pipeline
  • Mask repeats using RepeatMasker.
  • Run sequence through several programs.
  • Take predicted genes and do similarity search
    against ESTs and genes from other organisms.
  • Do similarity search for non-coding sequences to
    find ncRNA.

36
Annotation nomenclature
  • Known Gene Predicted gene matches the entire
    length of a known gene.
  • Putative Gene Predicted gene contains region
    conserved with known gene. Also referred to as
    like or similar to.
  • Unknown Gene Predicted gene matches a gene or
    EST of which the function is not known.
  • Hypothetical Gene Predicted gene that does not
    contain significant similarity to any known gene
    or EST.

37
CSHL Generic Genome Browser
http//www.wormbase.org/db/seq/gbrowse
38
NCBI Map Viewer
www.ncbi.nlm.nih.gov/mapview/static/MVstart.html
39
Ensembl Genome Browser
www.ensembl.org/
40
UCSC Genome Browser
genome.ucsc.edu/cgi-bin/hgGateway?orghuman
41
Apollo Genome Browser
http//www.bdgp.org/annot/apollo/
Write a Comment
User Comments (0)
About PowerShow.com