Gene Prediction - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Gene Prediction

Description:

Use 'gene structure' to predict transcripts and polypeptide (protein) sequences ... Dog MHC class II region so far (Doberman) RefSeq at NCBI ... – PowerPoint PPT presentation

Number of Views:592
Avg rating:3.0/5.0
Slides: 57
Provided by: blk
Category:

less

Transcript and Presenter's Notes

Title: Gene Prediction


1
TAGTCAGACAGAAAGGCAGGCACAAAGTACGGTAGAGTCTTCTAGCACTA
AAATCCTATTTGACCTTCTCCTGGGCCTTTTCTTCTAACACAGCCACAC
T ACCTTATATAATTCTTGTTGTAAGCAGAAAGTTGGCATGCCATCCAAA
CA AAACAACTTCCTTCCAGAGGACAGGTCCATGAGAACTTTCCCACAGA
TAC CCATTCACATACATTCAATGTCCTGGACAGGGCTCCTCCTCAGTCT
GCCA CGCAAGAAGAACACACAGGACACAGGGCATACTCTATTTGATTCA
ACTAG TGCGTTCCACGGACACTTTCTAACACAGTAGCTCTGGACCTAGA
ACGCGG CATCCAGCAGTACACTCTGCTAGATGAAGGGGGAGAAAAGGCA
TTTTGAA TACATTCTCTAAAAATCCTGACAGCAAGGCTACAGGTATATC
GAAGTATA ATGGAACAGTCACGAGGCCCCGGGTTTGATCCTCAGTGTGG
CTAAGCAAT GAATCCACATAGCAACTCGGGAATAATTATTTTAGCTTAT
TATTTTAAAA CGCCAGCGACTTTATTTTCTTCGCCCAAGCTCAAATTAA
TTAAAGGTTAT AAATGGTCACTTCTCCGTAGAAGCCAGAACTCTCCCCC
TCTTCAGAGCAG GGGAATACCTCATAAATAAATTAGGCGAAACCATGGC
TTGCTGATTGAAT GAATGATAATCCACAGTCCATGTGGTTGCCAAGTCT
TTCTCTAGACCTCT CTACCGCAATGAGCAATCCCTGAACGTCAACGAAG
AGGCTTACTTCATCA GTTATCTGGAAGTCTGCGAGTCGTGAAGACAGCC
CACAGAAATACTAGCT TCTCCACTCAGCCTCGATTCACCGGAAGGACCA
TGAAAAGGAACAGCACC AGTGAATCTGATGCGGCTCCCTTCCAACTCAC
TGCAGCTCAGTCAGCCTG
Gene Prediction
2
Gene Prediction
Identify genes from genomic (DNA) sequence
Elucidate gene structure exons, introns,
promoter
Use gene structure to predict transcripts and
polypeptide (protein) sequences
3
Genes
  • Prokaryotic genes
  • Eukaryotic genes

4
Prokaryotic Genes
  • Small genomes, high gene density
  • Haemophilus influenza genome 85 genic
  • Operons
  • One transcript, many genes
  • No introns.
  • One gene, one protein
  • Open reading frames
  • One ORF per gene
  • ORFs begin with start,
  • end with stop codon

5
(No Transcript)
6
Eukaryotic Genes
  • Much lower gene density in genome
  • Gene-rich regions
  • Gene-poor regions
  • Gene Desert - a region with no known, novel, or
    partial genes in a 500 kb
  • Undergo several post transcriptional
    modifications.
  • 5 CAP
  • Poly A tail
  • Splicing

7
(No Transcript)
8
Genome Sequencing
The Hybrid Approach
9
Recently Sequenced Genomes
10
Whole Genome Shotgun Sequencing Being Applied
Everywhere
Phylogenetic tree of proteorhodopsin-like genes
1.2 million new genes
? J. Craig Venter Institute now sequencing NYC
air
11
Unseen Benefit of Whole Genome Shotgun Sequencing
Wolbachia endosymbionts found in fly embryos
Genome sequences for three new Wolbachia species
reconstructed from sequence traces from 7
Drosophila species
95 (1,440,650 base pairs) for one of the new
species
12
Obtaining Genome Sequence Using SNPs
On-going project to resequence 15 inbred mouse
strains
Inbred strains are homozygous for every gene
Genotype each strain Find SNPs between each
strain and the reference strain (C57BL/6J) Use
Affymetrix SNP arrays to genotype strains
Infer genome sequence for these 15 strains
http//mouse.perlegen.com
13
Infer Genome Sequences for 15 Strains
SNPs between NOD/LtJ and C57BL/6J
Infer sequence of NOD/LtJ strain by substituting
bases
Do the same for the 14 other strains
14
Miscellaneous Assembly Details Gaps
BAC-end mate pairs used to make assembly
Assemblies contain gaps to space contigs
First 3Mb of all mouse chromosomes are Ns
100 bp for gaps within contigs Variable length
gaps between contigs to reflect size estimate
15
Gene Prediction
16
Gene Prediction Procedure
Obtain genomic sequence
Ensure vector sequences are removed
17
Analysis Pipelines
18
Genome Browsers
UCSC http//genome.ucsc.edu University of
Santa Cruz Annotate other gene builds Ensembl
http//www.ensembl.org EBI and Sanger
collaboration Gene build, predict novel
genes NCBI http//www.ncbi.nlm.nih.gov/mapview/
NCBI map viewer Gene build, predicts novel
genes
Pay attention to gene nomenclature
Build your own genome browser with GBrowse
http//www.gmod.org/ggb/
19
Genes Classified By Evidence
Known genes as catalogued by the reference
sequence project Ensembl known genes (red
genes) NCBI known genes Novel genes (1)
based on similarity to known genes, or cDNAs
these need not have 100 matching supporting
evidence Ensembl novel genes (black) NCBI Loc
genes
20
Genes Classified By Evidence
Novel genes (2) based on the presence of
ESTs resource of alternative splicing EST genes
in Ensembl (purple) Database of transcribed
sequences (DOTs) Acembly Gene
prediction Single organism Genscan Comparative
information Twinscan Pseudogenes - matches a
known gene but with a a disrupted ORF
21
Genes Classified By Evidence Microbial Genomes

Classified function Conserved, unknown
function Species specific, unknown
function Strain-specific Hypothetical protein
Other TIGR nomenclature rules
22
Known Gene (Nfkb1 )? Lots of Evidence
23
Supporting Evidence For Genes
mRNA
24
Example of a Novel Gene
25
Gene Resources
26
Resources That Represent Genes
NCBI Map Viewer Gene RefSeq
Ensembl
UCSC Genome Browser
Manual Genome Annotation Projects
27
http//www.ncbi.nlm.nih.gov/MapView
NCBI Map Viewer
28
Gene at NCBI
http//www.ncbi.nlm.nih.gov80-entrez/
query.fgi?dbgene
Records contain links to organism databases
29
RefSeq at NCBI
http//www.ncbi.nlm.nih.gov/RefSeq/
30
Genes in Ensembl
http//www.ensembl.org
31
Genes in Ensembl
http//www.ensembl.org/Mus_musculus/geneview?gene
ENSMUSG00000001657
32
Supporting Evidence in Ensembl
DNA
Protein
33
Genes in UCSC Genome Browser
http//genome.ucsc.edu
34
Genes in UCSC Genome Browser
35
Manual Genome Annotation Efforts
VEGA Project at Sanger
VEGA Vertebrate Genome Annotation project
Human 14 chromosomes (50 genome) Mouse Chr.
4 and 11 only Zebrafish 3546 genes so far Dog
MHC class II region so far (Doberman)
36
Gene Prediction Methods
37
Gene Prediction Programs
  • Compositional Methods
  • Scan for features in sequence using consensus
    sequence
  • ab initio methods
  • Only 50 accurate (1996)
  • Comparative Methods
  • Compare sequence to cDNA sequence databases
  • Compare sequence to EST sequence databases
  • ? Have to use both methods

38
Gene Prediction Programs
  • Ab initio gene prediction
  • First ones predicted single exons, e.g. GRAIL
    (Uberbacher, 91) or MZEF (Zhang, 97)
  • Later, predict entire genes e.g. Genscan (Burge
    97) and Fgenesh (Solovyev, 95)
  • Predict individual exons based on codon usage and
    sequence signals (start, stop, splice sites)
    followed by assembly of putative exons into genes
  • Genscan predicts 90 of coding nucleotides, and
    70 of coding exons (Guigo, 00)
  • Can not use gene prediction methods alone to
    accurately identify every gene in a genome

39
Comparing Genes Prediction Programs
  • Sn Sensitivity TP/(TPFN)
  • How many exons were found out of total present?
  • Sp Specificity TP/(TPFP)
  • How many predicted exons were correct out of
    total exons predicted?

40
Gene-finder Comparisons Drosophila Adh
c
A 2.9Mb region of the Drosophila genome
containing the Adh locus was manually curated
using unpublished cDNA sequences. Six different
gene-finding systems were applied and the results
were compared. (Reese, et al, 2000, Genome
Research)
Chlamydomonas
c
Li, et al, 2003, analyzed 0.5 of finished
Chlamydomonas genome. 158 known genes were used
to assess ab initio gene-finders. The program
TAP, which performs EST assembled gene
identification, was used as an independent
performance measure.
David Kulp
41
Length Distributions for Exons and Introns
Gene prediction programs must be tuned to a
particular species
David Kulp
42
Gene Prediction Programs
TwinScan
Genie
43
Twinscan
44
Twinscan
Gene structure prediction model Extends
probability model of GENSCAN Exploits homology
between two related genomes Notable improvement
on GENSCAN
45
Twinscan
46
Known Gene (Nfkb1 )? Lots of Evidence
47
Twinscan
http//genes.cs.wustl.edu
48
Twinscan
49
Genie
50
Gene Structures as Grammars
  • Searls (1988) introduced ideas of formal language
    theory in biosequence analysis
  • Context-free grammar recursive decomposition

David Kulp
51
Models
Gene Model
B Begin position S start position D donor
site (gt) A acceptor site (ag) T termination
site F final position
U5 5 UTR U3 3 UTR EI exon to
intron boundary SE single exon I intron E
exon FE final coding exon
States
Transitions
52
Models and Graphs
Gene Model
David Kulp
53
Default Genie Gene Model
David Kulp
Genie addresses problem of stop codons that span
two exons
54
(No Transcript)
55
Other Gene Prediction Programs
  • ORF detectors
  • NCBI http//www.ncbi.nih.gov/gorf/gorf.html
  • Promoter predictors
  • CSHL http//rulai.cshl.org/software/index1.htm
  • BDGP fruitfly.org/seq_tools/promoter.html
  • ICG TATA-Box predictor
  • PolyA signal predictors
  • CSHL http//rulai.cshl.org/tools/polyadq/polyadq_
    form.html
  • Splice site predictors
  • BDGP http//www.fruitfly.org/seq_tools/splice.htm
    l
  • Start-/stop-codon identifiers
  • DNALC Translator/ORF-Finder
  • BCM Searchlauncher
  • Genie (Have to download source code, compile,
    and install to run)

http//brl.cs.umass.edu/Research/GenePredictionWit
hConstraints
56
Acknowledgements
  • Gareth Howell
  • University of Sheffield, UK
  • David Kulp
  • University of Massachusetts - Amherst
Write a Comment
User Comments (0)
About PowerShow.com