Gene Prediction approaches - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Gene Prediction approaches

Description:

Gene Prediction approaches Talk By Joy Scaria The value of genome sequences lies in their annotation Annotation Characterizing genomic features using ... – PowerPoint PPT presentation

Number of Views:365
Avg rating:3.0/5.0
Slides: 27
Provided by: uwehi9
Category:

less

Transcript and Presenter's Notes

Title: Gene Prediction approaches


1
Gene Prediction approaches
Talk By Joy Scaria
2
The value of genome sequences lies in their
annotation
  • Annotation Characterizing genomic features
    using computational and experimental methods
  • Genes Four levels of annotation
  • Gene Prediction Where are genes?
  • What do they look like?
  • Domains What do the proteins do?
  • Role What pathway(s) involved in?

3
How many genes?
  • Consortium 35,000 genes?
  • Celera 30,000 genes?
  • Affymetrix 60,000 human genes on GeneChips?
  • Incyte and HGS over 120,000 genes?
  • GenBank 49,000 unique gene coding sequences?
  • UniGene gt 89,000 clusters of unique ESTs?

4
Current consensus (in flux )
  • 15,000 known genes (similarity to previously
    isolated genes and expressed sequences from a
    large variety of different organisms)
  • 17,000 predicted (GenScan, GeneFinder, GRAIL)
  • Based on and limited to previous knowledge

5
How to we get from here
6
to here,
7
What are genes? - 1
  • Complete DNA segments responsible to make
    functional products
  • Products
  • Proteins
  • Functional RNA molecules
  • RNAi (interfering RNA)
  • rRNA (ribosomal RNA)
  • snRNA (small nuclear)
  • snoRNA (small nucleolar)
  • tRNA (transfer RNA)

8
What are genes? - 2
  • Definition vs. dynamic concept
  • Consider
  • Prokaryotic vs. eukaryotic gene models
  • Introns/exons
  • Posttranscriptional modifications
  • Alternative splicing
  • Differential expression
  • Multi-subunit proteins

9
Prokaryotic gene model ORF-genes
  • Small genomes, high gene density
  • Haemophilus influenza genome 85 genic
  • Operons
  • One transcript, many genes
  • No introns.
  • One gene, one protein
  • Open reading frames
  • One ORF per gene
  • ORFs begin with start,
  • end with stop codon (def.)

TIGR http//www.tigr.org/tigr-scripts/CMR2/CMRGen
omes.spl NCBI http//www.ncbi.nlm.nih.gov/PMGifs/
Genomes/micr.html
10
Promoter
11
Eukaryotic gene model spliced genes
  • Posttranscriptional modification
  • 5-CAP, polyA tail, splicing
  • Open reading frames
  • Mature mRNA contains ORF
  • All internal exons contain open read-through
  • Pre-start and post-stop sequences are UTRs
  • Multiple translates
  • One gene many proteins via alternative splicing

12
Expansions and Clarifications
  • ORFs
  • Start triplets stop
  • Prokaryotes gene ORF
  • Eukaryotes spliced genes or ORF genes
  • Exons
  • Remain after introns have been removed
  • Flanking parts contain non-coding sequence (5-
    and 3-UTRs)

13
Where do genes live?
  • In genomes
  • Example human genome
  • 3,200,000,000 base pairs
  • chromosomes 1-22, X, Y, mt
  • 28,000-45,000 genes (current estimate)
  • 25 of genome are genes (introns, exons)
  • 1 of genome codes for amino acids (CDS)
  • 30 kb gene length (average)
  • 1.4 kb ORF length (average)
  • 3 transcripts per gene (average)

14
So much DNA so few genes
15
Genomic sequence features
  • Repeats (Junk DNA)
  • Transposable elements, simple repeats
  • RepeatMasker
  • Genes
  • Vary in density, length, structure
  • Identification depends on evidence and methods
    and may require concerted application of
    bioinformatics methods and lab research
  • Pseudo genes
  • Look-a-likes of genes, obstruct gene finding
    efforts.

16
Gene identification
  • Homology-based gene prediction
  • Similarity Searches (e.g. BLAST, BLAT)
  • Genome Browsers
  • RNA evidence (ESTs)
  • Ab initio gene prediction
  • Gene prediction programs
  • Prokaryotes
  • ORF identification
  • Eukaryotes
  • Promoter prediction
  • PolyA-signal prediction
  • Splice site, start/stop-codon predictions

17
Gene prediction through comparative genomics
  • Highly similar (Conserved) regions between two
    genomes are useful or else they would have
    diverged
  • If genomes are too closely related all regions
    are similar, not just genes
  • If genomes are too far apart, analogous regions
    may be too dissimilar to be found

18
Genome Browsers
NCBI Map Viewer www.ncbi.nlm.nih.gov/mapview/
Generic Genome Browser (CSHL) www.wormbase.org/db
/seq/gbrowse
Ensembl Genome Browser www.ensembl.org/
UCSC Genome Browser genome.ucsc.edu/cgi-bin/hgGate
way?orghuman
Apollo Genome Browser www.bdgp.org/annot/apollo/
19
Gene discovery using ESTs
  • Expressed Sequence Tags (ESTs) represent
    sequences from expressed genes.
  • If region matches EST with high stringency then
    region is probably a gene or pseudo gene.
  • EST overlapping exon boundary gives an accurate
    prediction of exon boundary.

20
Ab initio gene prediction
  • Prokaryotes
  • ORF-Detectors
  • Eukaryotes
  • Position, extent direction through promoter
    and polyA-signal predictors
  • Structure through splice site predictors
  • Exact location of coding sequences through
    determination of relationships between potential
    start codons, splice sites, ORFs, and stop codons

21
Tools
  • ORF detectors
  • NCBI http//www.ncbi.nih.gov/gorf/gorf.html
  • Promoter predictors
  • CSHL http//rulai.cshl.org/software/index1.htm
  • BDGP fruitfly.org/seq_tools/promoter.html
  • ICG TATA-Box predictor
  • PolyA signal predictors
  • CSHL argon.cshl.org/tabaska/polyadq_form.html
  • Splice site predictors
  • BDGP http//www.fruitfly.org/seq_tools/splice.htm
    l
  • Start-/stop-codon identifiers
  • DNALC Translator/ORF-Finder
  • BCM Searchlauncher

22
How it works I Motif identification
  • Exon-Intron Borders Splice Sites

Exon Intron
Exon  gaggcatcaggtttgtagactgtgtttcag
tgcacccact ccgccgctgagtgagccgtgtc
tattctaggacgcgcggg tgtgaattaggtaagaggtt
atatctccagatggagatca ccatgaggaggtgagtg
ccattatttccaggtatgagacg
Splice site Splice site
Exon Intron
Exon  gaggcatcagGTttgtagactgtgtttcAG
tgcacccact ccgccgctgaGTgagccgtgtc
tattctAGgacgcgcggg tgtgaattagGTaagaggtt
atatctccAGatggagatca ccatgaggagGTgagtg
ccattatttccAGgtatgagacg
Splice site Splice site
Motif Extraction Programs at http//www-btls.jst.g
o.jp/
23
Gene prediction programs
  • Rule-based programs
  • Use explicit set of rules to make decisions.
  • Example GeneFinder
  • Neural Network-based programs
  • Use data set to build rules.
  • Examples Grail, GrailEXP
  • Hidden Markov Model-based programs
  • Use probabilities of states and transitions
    between these states to predict features.
  • Examples Genscan, GenomeScan

24
Common difficulties
  • First and last exons difficult to annotate
    because they contain UTRs.
  • Smaller genes are not statistically significant
    so they are thrown out.
  • Algorithms are trained with sequences from known
    genes which biases them against genes about which
    nothing is known.
  • Masking repeats frequently removes potentially
    indicative chunks from the untranslated regions
    of genes that contain repetitive elements.

25
The annotation pipeline
  • Mask repeats using RepeatMasker.
  • Run sequence through several programs.
  • Take predicted genes and do similarity search
    against ESTs and genes from other organisms.
  • Do similarity search for non-coding sequences to
    find ncRNA.

26
Annotation nomenclature
  • Known Gene Predicted gene matches the entire
    length of a known gene.
  • Putative Gene Predicted gene contains region
    conserved with known gene. Also referred to as
    like or similar to.
  • Unknown Gene Predicted gene matches a gene or
    EST of which the function is not known.
  • Hypothetical Gene Predicted gene that does not
    contain significant similarity to any known gene
    or EST.
Write a Comment
User Comments (0)
About PowerShow.com