Title: Gene PredictionIdentification
1Gene Prediction/Identification
2Prokaryotic Genomes
- Small genomes, high gene density
- Haemophilus influenza genome 85 genic
- 1st cellular genome, 1985 Venter et al., 1.8 Mb
- Operons
- One transcript, many genes
- No introns.
- One gene, one protein
- Open reading frames
- One ORF per gene
- ORFs begin with start,
- end with stop codon
3Identification of Functional Units
- Genome sequencing
- Transcript sequencing
- Genes - Coding and Non coding
- Others
- Promoters
- Transcription factor binding sites
- Replication origins
- ENCODE project http//www.genome.gov/10005107
4Sequencing large genomes
The hybrid approach
5Genome Assembly
Chromosome project
6Genomes available
Data reproduced from Genome Maps 10, Science 1999.
Human
Chimpanzee
Fish
Mouse
Rat
Bull
Sheep
Muntjac
Dolphin
Pig
Horse
Cat
Mink
Bear
G. Panda
Gibbon
Macaque
Tree Shrew
Bat
Euro. Shrew
Lemur
- Invertebrates
- Fly Worm Yeast Bacteria
- Plants
7Gene identification
- When is a gene a gene?
- All genes need some kind of supporting evidence
- Does the region annotated as a gene have
evidence of actually being transcribed?
8The evidence for a gene
mRNA
reverse transcription
cDNA
9Sequence submission
- Information is mirrored daily between DDBJ
(Japan), GenBank (USA) and EMBL (Europe).
10Information is mirrored daily between databases
Most comprehensive source of DNA seqeuences
11Summary of Available Databases
DNA databases (EMBL/Genbank/DDBJ)
mRNA (cDNA)
Genomic (finished, draft)
dbEST (ESTs)
12Information Hubs- entry points for interrogation
13Gene Catalogues
14Ensembl Gene catalogue
Supporting evidence (exon)
Protein alignment
DNA alignment
15Gene Catalogues
16Reference Sequence
Goal One sequence entry for each naturally
occurring DNA, RNA and protein molecule
chromosome
gene
NC_000000
NG_000000
Key curated calculated
mRNA
protein
contig
NM_000000
NP_000000
NT_000000
predictedmRNA
predictedprotein
XM_000000
XP_000000
Multiple products for one gene are instantiated
as separate RefSeqs with the same GeneID.
17Gene Catalogues
18General Gene classification
Known genes as catalogued by the reference
sequence project Ensembl known genes (red
genes) NCBI known genes Novel genes (1)
based on similarity to known genes, or cDNAs
these need not have 100 matching supporting
evidence Ensembl novel genes (black) NCBI Loc
genes (locus)
19General Gene classification
Novel genes (2) based on the presence of
ESTs resource of alternative splicing EST genes
in Ensembl (purple) Database of transcribed
sequences (DOTs) www.allgenes.org Acembly
(assembly) Gene prediction Single organism
Genscan Comparative information
Twinscan commonalities and differences Pseudog
enes - matches a known gene but with a disrupted
ORF Gene prediction with NO prior expressed
sequence as evidence
20Gene prediction
- Predicting genes in genomic sequence with NO
supporting expressed evidence
- Compositional Methods
- Scan for features in sequence using consensus
sequence - ab initio methods
- Comparative Methods
- Compare sequence to cDNA sequence databases
- Compare sequence to EST sequence databases
- ? Have to use both methods
21Ab initio gene prediction
- First ones predicted single exons, e.g. GRAIL
(Uberbacher, 91) or MZEF (Zhang, 97) - Later, predict entire genes e.g. Genscan (Burge
97) and Fgenesh (Solovyev, 95) - Predict individual exons based on codon usage and
sequence signals (start, stop, splice sites)
followed by assembly of putative exons into genes - Genscan predicts 90 of coding nucleotides, and
70 of coding exons (Guigo, 00) - Can not use gene prediction methods alone to
accurately identify every gene in a genome
22Comparative Methods
- Compare two genomes and identify conserved
regions (see Genome analysis module) - Combine with gene prediction
- E.g Twinscan
TWINSCAN version 3.0 now called N-SCAN
23The complete Gene Catalogue
- How do we know when we have identified every gene
in a genome? - The data is fluid
- genome sequence is refined
- more transcripts are sequenced
- gene prediction methods are improved
- When looking at databases always keep a record of
the VERSION you are using - NCBI genome build, Ensembl gene build, nucleotide
sequence version etc
24Genome/Gene Access
- All databases are linked
- All use same/similar raw data
- Data handled in slightly different ways
- e.g Ensembl gene build vs Gnomon gene prediction
25Choose the best point of entry
- Literature Pubmed (NCBI)
- Disease name OMIM (Human), MGI (Mouse)
- Gene info EntrezGene (NCBI)
- Genetic Interval Ensembl
- Microarray probe list Ensembl
- Comparative information UCSC
- Self generated sequence BLAST/BLAT (e!, UCSC,
NCBI) - Can link out to most other places
26NCBI Entry points
27Entrez entry point
28EntrezGene
29Ensembl entry points
30Ensembl entry points
31Ensembl Entry points
32Real problem
- Disease of Interest
- ENU-induced mutation causes homeotic-like
transformation to skeleton - Genetically mapped to D5Mit128-D5Mit107