Gene PredictionIdentification - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Gene PredictionIdentification

Description:

draft. WGS. The hybrid approach. August 2006. The Genome Access Course. Genome Assembly ... Horse. Cat. Mink. Bear. G. Panda. Gibbon. Macaque. Lemur. Tree ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 33
Provided by: gareth76
Category:

less

Transcript and Presenter's Notes

Title: Gene PredictionIdentification


1
Gene Prediction/Identification
2
Prokaryotic Genomes
  • Small genomes, high gene density
  • Haemophilus influenza genome 85 genic
  • 1st cellular genome, 1985 Venter et al., 1.8 Mb
  • Operons
  • One transcript, many genes
  • No introns.
  • One gene, one protein
  • Open reading frames
  • One ORF per gene
  • ORFs begin with start,
  • end with stop codon

3
Identification of Functional Units
  • Genome sequencing
  • Transcript sequencing
  • Genes - Coding and Non coding
  • Others
  • Promoters
  • Transcription factor binding sites
  • Replication origins
  • ENCODE project http//www.genome.gov/10005107

4
Sequencing large genomes
The hybrid approach
5
Genome Assembly
Chromosome project
6
Genomes available
  • Vertebrates

Data reproduced from Genome Maps 10, Science 1999.
Human
Chimpanzee
Fish
Mouse
Rat
Bull
Sheep
Muntjac
Dolphin
Pig
Horse
Cat
Mink
Bear
G. Panda
Gibbon
Macaque
Tree Shrew
Bat
Euro. Shrew
Lemur
  • Invertebrates
  • Fly Worm Yeast Bacteria
  • Plants

7
Gene identification
  • When is a gene a gene?
  • All genes need some kind of supporting evidence
  • Does the region annotated as a gene have
    evidence of actually being transcribed?

8
The evidence for a gene
mRNA
reverse transcription
cDNA
9
Sequence submission
  • Information is mirrored daily between DDBJ
    (Japan), GenBank (USA) and EMBL (Europe).

10
Information is mirrored daily between databases
Most comprehensive source of DNA seqeuences
11
Summary of Available Databases
DNA databases (EMBL/Genbank/DDBJ)
mRNA (cDNA)
Genomic (finished, draft)
dbEST (ESTs)
12
Information Hubs- entry points for interrogation
  • Ensembl UCSC NCBI

13
Gene Catalogues
  • Ensembl - gene build

14
Ensembl Gene catalogue
  • Gene view exon info

Supporting evidence (exon)
Protein alignment
DNA alignment
15
Gene Catalogues
  • NCBI - Refseq Gnomon

16
Reference Sequence
Goal One sequence entry for each naturally
occurring DNA, RNA and protein molecule
chromosome
gene
NC_000000
NG_000000
Key curated calculated
mRNA
protein
contig
NM_000000
NP_000000
NT_000000
predictedmRNA
predictedprotein
XM_000000
XP_000000
Multiple products for one gene are instantiated
as separate RefSeqs with the same GeneID.
17
Gene Catalogues
  • UCSC

18
General Gene classification
Known genes as catalogued by the reference
sequence project Ensembl known genes (red
genes) NCBI known genes Novel genes (1)
based on similarity to known genes, or cDNAs
these need not have 100 matching supporting
evidence Ensembl novel genes (black) NCBI Loc
genes (locus)
19
General Gene classification
Novel genes (2) based on the presence of
ESTs resource of alternative splicing EST genes
in Ensembl (purple) Database of transcribed
sequences (DOTs) www.allgenes.org Acembly
(assembly) Gene prediction Single organism
Genscan Comparative information
Twinscan commonalities and differences Pseudog
enes - matches a known gene but with a disrupted
ORF Gene prediction with NO prior expressed
sequence as evidence
20
Gene prediction
  • Predicting genes in genomic sequence with NO
    supporting expressed evidence
  • Compositional Methods
  • Scan for features in sequence using consensus
    sequence
  • ab initio methods
  • Comparative Methods
  • Compare sequence to cDNA sequence databases
  • Compare sequence to EST sequence databases
  • ? Have to use both methods

21
Ab initio gene prediction
  • First ones predicted single exons, e.g. GRAIL
    (Uberbacher, 91) or MZEF (Zhang, 97)
  • Later, predict entire genes e.g. Genscan (Burge
    97) and Fgenesh (Solovyev, 95)
  • Predict individual exons based on codon usage and
    sequence signals (start, stop, splice sites)
    followed by assembly of putative exons into genes
  • Genscan predicts 90 of coding nucleotides, and
    70 of coding exons (Guigo, 00)
  • Can not use gene prediction methods alone to
    accurately identify every gene in a genome

22
Comparative Methods
  • Compare two genomes and identify conserved
    regions (see Genome analysis module)
  • Combine with gene prediction
  • E.g Twinscan

TWINSCAN version 3.0 now called N-SCAN
23
The complete Gene Catalogue
  • How do we know when we have identified every gene
    in a genome?
  • The data is fluid
  • genome sequence is refined
  • more transcripts are sequenced
  • gene prediction methods are improved
  • When looking at databases always keep a record of
    the VERSION you are using
  • NCBI genome build, Ensembl gene build, nucleotide
    sequence version etc

24
Genome/Gene Access
  • All databases are linked
  • All use same/similar raw data
  • Data handled in slightly different ways
  • e.g Ensembl gene build vs Gnomon gene prediction

25
Choose the best point of entry
  • Literature Pubmed (NCBI)
  • Disease name OMIM (Human), MGI (Mouse)
  • Gene info EntrezGene (NCBI)
  • Genetic Interval Ensembl
  • Microarray probe list Ensembl
  • Comparative information UCSC
  • Self generated sequence BLAST/BLAT (e!, UCSC,
    NCBI)
  • Can link out to most other places

26
NCBI Entry points
27
Entrez entry point
28
EntrezGene
29
Ensembl entry points
  • Home page

30
Ensembl entry points
  • Genetic interval

31
Ensembl Entry points
  • Biomart

32
Real problem
  • Disease of Interest
  • ENU-induced mutation causes homeotic-like
    transformation to skeleton
  • Genetically mapped to D5Mit128-D5Mit107
Write a Comment
User Comments (0)
About PowerShow.com