Title: Design and Sources of Annotation
1Design and Sources of Annotation
2From prokaryotes to eukaryotes
Escherichia coli 4300 genes Saccharomyces
cerevisiae 6400 genes (about 250 with introns)
- (higher) Eukaryotes
- medium sized to large genomes
- complex gene structures
- split gene structure is a general feature
- genes can spread over several hundred kb
- (Arabidopsis 30kb, 80 exons)
- transsplicing alternative splicing
- gt challenge to model individual exons and
combine - them to complete gene models
Neurospora crassa 13000 genes estimated (more
than 50 with introns) Caenorhabditis elegans
19000 genes Drosophila melanogaster 13000
genes Arabidopsis thaliana 25000 genes (few
genes have no introns)
3Guided tour through analysis, genemodelling and
annotation routine
4semiautomatic sequence analysis
sequencing lab
- vector contamination
- ambiguity codes
- overlap comparison
- coherence with restriction map
- verification (sequencing labs/independent lab)
- overlap inconsistencies
- putative frameshifts/errors in coding regions
DNA sequence
public ftp directory DNA database
web
- database searches
- BLASTN EMBL dbEST(organism)
- BLASTX nonred. protein trEST(organism)
- gene predictions
- GENEMARK
- XGRAIL
- NETPLANTGENE
- GENEFINDER
- GENSCAN
- FGENESH
- features
- repeats
- tRNAscan
parse and filter
gene models features
graphical display
51. Sequence analysis
- programs for geneprediction (Genefinder,
Genscan, Genmark, Fgenesh) - blastx comparison ? nonred Proteindatabase
(TREMBL, PIR, - SWISSPROT
- EST comparison ? ESTdatabase (Organism
specific EMBLESTs) - blastn comparison ? EMBL (identical genes,
rRNAs, contamination) - tRNA prediction program (tRNAscan) ? tRNAs
- GCG repeat ? tandem repeats
- blastp comparison of preselected genemodels
derived from the prediction - program, specifically trained for the organism
- ? nonred Proteindatabase
- text file for annotation is written with
preselected genemodels
6Genemodeling is a major task in higher eukaryotes
5
3
correct genestructure
extended exon missing exon additional
exon missing intron extended gene model
72. Gene modelling
- Graphical tool to visualize gene models
(developed by Grigory Kolesov) -
- all predicted gene models
- blastx matches
- EST matches
- DNA sequence
- three frame translation
- search tool
- headlines of highlighted matches
- codes of models
- marks exons in the sequence
- reloading gene model after correction
- the mouse pointer is connected to the sequence
coordinate
8VisualisationViewer for genemodelling
ESTs
Blastx
Orfs
Gene models
final gene model
9Visualisation
individual blast matches are displayed
this model has to be divided into two genes
10individual EST matches are displayed
the ests also confirm the existence of two genes
individual EST matches are displayed
11click on a gene model
- sequence and
- translation is
- shown
- aa and nt
- searches are
- possible
12problems in evaluation
no standard of truth
comparison of 3 different data sets against each
other (A.thaliana, three annotation groups)
13discrepancy examplesEXAMPLE I
5 exon extension
MSL1MGL6_36/ MYA6.14/ test_370
14EXAMPLE II
- separation of 3 genes by all 3 parties
- two groups decided to incorporate exon from
genefinder
MSL1MGL6_57/ MDC8.17/ test_580
15EXAMPLE III
right part consistent in all groups
receptor protein kinase, very heterogenous blastx
matches, very variable predictions at 5end
MSL1MGL6_6/ MSL1.6/ test_60
16New genomes, new challenges
Some examples from Rice
17Example from rice
Annotation genmark.hmm FGENESH Genscan Genefinder
C200 vesicle associated membrane protein (VAMP)
programs are not specifically trained no program
is able to predict the correct gene
18Example from rice
Annotation genmark.hmm FGENESH Genscan Genefinder
C90 WRKY DNA binding protein
two programs predict correct gene
19Example from rice
Genefinder Genscan FGENESH genmark.hmm
Annotation
Glycerol 3-phosphate Permease
two programs predict correct gene
20Example from rice
Genefinder Genscan FGENESH genmark.hmm
Annotation
W455 GTP binding protein
no correct prediction
21Examples for the reliabiltyof trained programs
BUT also trained programs are not always
correct problems occur mostly in 5' and 3' exons
22EST matches confirm the correct prediction
Est matches are often too short to
confirm intron-exon structure
23Est matches, no obvious ORF
3Est
Annotated feature
5Est
24example from modeling neurospora crassa genes
no blast matches, no ests we have to rely on
the trained program
25Summary of gene prediction
- - gene prediction in higher eukaryotes is a major
task - - in parallel usage of a collection of programs
increases reliability - - combination of extrinsic and intrinsic data
increases quality - gene prediction programs have to be trained on
organism - specific training sets
- assessment of reliability has to be done BEFORE
starting - large scale genome annotation
26Annotation routine
gene model
annotated EMBL submission
database
web
- Extraction
- GENSCAN
- Genmark.hmm
- GENEFINDER
- Annotation
- classification
- functional category
- cDNA/EST matches
- prosite patterns
- literature
- ...
- nonred. protein
- database
- ESTdatabase
- Inspection
- Correction
- corrected models
web
272. Manual annotation
- predicted genes are corrected acc to ESTs and to
really good blast - matches against "known" proteins (experimentally
verified - proteins ) (BUT ATTENTION) Experimentally
validated proteins - can also be incorrect
- blastp comparison (corrected genemodels are
extracted and compared - to our nonredundant protein database) to verify
correction - to annotate similarity
- known proteins are selected from the blastp
matches - time consuming task
- matching hypothetical yeast proteins from PIR or
other entries have to - be looked up in Yeast databases (MIPS CYGD or SGD
or YPD) - a lot of them are known to exist and have
validated functions - fasta comparison for better impression of
similarity (similarity from - start to stop?)
28Which of these proteins is a "known" protein?
Four yeast proteins have to be looked up for
meanwhile validation calcium-related ... ?? Is
it a validated protein
29(No Transcript)
30(No Transcript)
31annotated features gene/protein level
- feature
- gene structure
- gene code
- protein title
- classification
- functional classification
- if possible
- similarity
- EST matches
- motifs
- only for class 1 proteins
- EC number
- cross reference
- literature .....
- program/tool
- blastx (PIR), blastn (ESTs),gene prediction
algorithms, - closest homolog/ probable/
- related to/hypothetical
- MIPS classification catalogue
- MIPS functional catalogue
- FASTA comparison (Score)
- blastn ESTs from EMBL)
- motifs (GCG)
- PIR
- PIR, Swissprot
- MIPS literature database
32MIPS classification catalogue
33CLASS 2 strong similarity to known protein
34CLASS 3 similarity to known protein
35CLASS 4 similarity or strong similarity to
unknown protein
CLASS 5 similarity to EST
CLASS 6 no similarity
36MIPS functional catalogue
- METABOLISM
- ENERGY
- CELL GROWTH, CELL DIVISION AND DNA SYNTHESIS
- TRANSCRIPTION
- PROTEIN SYNTHESIS
- PROTEIN DESTINATION
- TRANSPORT FACILITATION
- INTRACELLULAR TRANSPORT
- CELLULAR BIOGENESIS
- SIGNAL TRANSDUCTION
- CELL RESCUE,DEFENSE,CELL DEATH AND AGING
- IONIC HOMEOSTASIS
- CELLULAR ORGANIZATION
- RETROTRANSPOSONS AND PLASMID PROTEINS
- CLASSIFICATION NOT YET CLEAR CUT
- UNCLASSIFIED PROTEINS
37text file for annotation
384. Automatical annotation
- submission to PEDANT (Protein Extraction ,
Description, and - Analysis Tool) calculation
- developed by Dmitrij Frishman
- additional annotation of potential protein
function and structure
39Databases for finding functional
assignment PIR-International (PIR keywords,
PIR families and superfamilies) SWISSPROT
PROSITE Protein sites and Patterns InterPro
(PROSITE, Pfam, ProDom, PRINTS) BLOCKS aligned
ungapped segments of highly conserved
regions Pfam collection of common protein
domains COGS Clusters of Orthologous
Groups PubMed Enzyme databases Expasy
(Swissprot) Enzyme nomenclature database With
crosslinks to WIT (Metabolic reconstruction)
(Pathway) to Kyoto University
Ligand Chemical Database (Pathway)
Databases for finding structural assignment PDB
(Experimentally determined three-dimensional
structures) SCOP Structural classification of
Proteins
40(No Transcript)
41(No Transcript)
42(No Transcript)
43 PIR fam
Pfam dom
BLOCKS BLAST
44(No Transcript)
45MATDB MIPS Arabidopsis thaliana database
46CYGD comprehensive Yeast genome database
YPD (Yeast Proteome Database) SGD (Saccharomyces
Genome Database)
47(No Transcript)
48(No Transcript)
49CYGD - annotation
50Protein/Protein interactions and functional
networks
M. Fellenberg, MIPS
51Mapping interactions to functions
mRNA processing cluster thin boxes different
functions light grey transcription dark grey
Splicing Grey areas Processing complex
M. Fellenberg et al., MIPS
52More information on yet unknown functions
Orphant YNR053c interacts with 5 proteins of
known function. These interact with many
proteins of unknown function.
M. Fellenberg et al., MIPS
53Functional Analysis Data
- data sets from large scale experiments
- data from expression analysis (microarrays)
- data from proteomics
- data from ??
- how can they be integrated into annotated
genomes to - be used best by scientists
-
54(No Transcript)
55 Expression profiles
56THM
- Sequencing complete genomes is now a fast process
- Automated gene prediction with specifically
trained prediction programs - is possible, but error-prone
- Manual supervision and correction is necessary
- Basic manual annotation in genome projects
ensures quality, if certain quality - standards are regarded
- Manual annotation is time and cost consuming,
but there is a payoff!!! - Extensive annotation increases the value of a
genome database remarkably - (A.thaliana database has about 10-20 000 accesses
per day, - the yeast database about 10 000)
- Large experimental data sets have to be
integrated into the annotation - process and added even after finishing a genome
project - Integration of as much experimental results as
possible is very important - and is in the future dependent on the
collaboration of the - scientific community.