Design and Sources of Annotation - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Design and Sources of Annotation

Description:

split gene structure is a general feature. genes can spread over several ... CELLULAR BIOGENESIS. SIGNAL TRANSDUCTION. CELL RESCUE,DEFENSE,CELL DEATH AND AGING ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 57
Provided by: mt749
Category:

less

Transcript and Presenter's Notes

Title: Design and Sources of Annotation


1
Design and Sources of Annotation
  • http//mips.gsf.de

2
From prokaryotes to eukaryotes
Escherichia coli 4300 genes Saccharomyces
cerevisiae 6400 genes (about 250 with introns)
  • (higher) Eukaryotes
  • medium sized to large genomes
  • complex gene structures
  • split gene structure is a general feature
  • genes can spread over several hundred kb
  • (Arabidopsis 30kb, 80 exons)
  • transsplicing alternative splicing
  • gt challenge to model individual exons and
    combine
  • them to complete gene models

Neurospora crassa 13000 genes estimated (more
than 50 with introns) Caenorhabditis elegans
19000 genes Drosophila melanogaster 13000
genes Arabidopsis thaliana 25000 genes (few
genes have no introns)
3
Guided tour through analysis, genemodelling and
annotation routine
  • sequence analysis
  • gene modelling
  • manual annotation
  • automated annotation

4
semiautomatic sequence analysis
sequencing lab
  • vector contamination
  • ambiguity codes
  • overlap comparison
  • coherence with restriction map
  • verification (sequencing labs/independent lab)
  • overlap inconsistencies
  • putative frameshifts/errors in coding regions

DNA sequence
public ftp directory DNA database
web
  • database searches
  • BLASTN EMBL dbEST(organism)
  • BLASTX nonred. protein trEST(organism)
  • gene predictions
  • GENEMARK
  • XGRAIL
  • NETPLANTGENE
  • GENEFINDER
  • GENSCAN
  • FGENESH
  • features
  • repeats
  • tRNAscan

parse and filter
gene models features
graphical display
5
1. Sequence analysis
  • programs for geneprediction (Genefinder,
    Genscan, Genmark, Fgenesh)
  • blastx comparison ? nonred Proteindatabase
    (TREMBL, PIR,
  • SWISSPROT
  • EST comparison ? ESTdatabase (Organism
    specific EMBLESTs)
  • blastn comparison ? EMBL (identical genes,
    rRNAs, contamination)
  • tRNA prediction program (tRNAscan) ? tRNAs
  • GCG repeat ? tandem repeats
  • blastp comparison of preselected genemodels
    derived from the prediction
  • program, specifically trained for the organism
  • ? nonred Proteindatabase
  • text file for annotation is written with
    preselected genemodels

6
Genemodeling is a major task in higher eukaryotes
5
3
correct genestructure
extended exon missing exon additional
exon missing intron extended gene model
7
2. Gene modelling
  • Graphical tool to visualize gene models
    (developed by Grigory Kolesov)
  • all predicted gene models
  • blastx matches
  • EST matches
  • DNA sequence
  • three frame translation
  • search tool
  • headlines of highlighted matches
  • codes of models
  • marks exons in the sequence
  • reloading gene model after correction
  • the mouse pointer is connected to the sequence
    coordinate

8
VisualisationViewer for genemodelling
ESTs
Blastx
Orfs
Gene models
final gene model
9
Visualisation
individual blast matches are displayed
this model has to be divided into two genes
10
individual EST matches are displayed
the ests also confirm the existence of two genes
individual EST matches are displayed
11
click on a gene model
  • sequence and
  • translation is
  • shown
  • aa and nt
  • searches are
  • possible

12
problems in evaluation
no standard of truth
comparison of 3 different data sets against each
other (A.thaliana, three annotation groups)
13
discrepancy examplesEXAMPLE I
5 exon extension
MSL1MGL6_36/ MYA6.14/ test_370
14
EXAMPLE II
  • separation of 3 genes by all 3 parties
  • two groups decided to incorporate exon from
    genefinder

MSL1MGL6_57/ MDC8.17/ test_580
15
EXAMPLE III
right part consistent in all groups
receptor protein kinase, very heterogenous blastx
matches, very variable predictions at 5end
MSL1MGL6_6/ MSL1.6/ test_60
16
New genomes, new challenges
Some examples from Rice
17
Example from rice
Annotation genmark.hmm FGENESH Genscan Genefinder
C200 vesicle associated membrane protein (VAMP)
programs are not specifically trained no program
is able to predict the correct gene
18
Example from rice
Annotation genmark.hmm FGENESH Genscan Genefinder
C90 WRKY DNA binding protein
two programs predict correct gene
19
Example from rice
Genefinder Genscan FGENESH genmark.hmm
Annotation
Glycerol 3-phosphate Permease
two programs predict correct gene
20
Example from rice
Genefinder Genscan FGENESH genmark.hmm
Annotation
W455 GTP binding protein
no correct prediction
21
Examples for the reliabiltyof trained programs
BUT also trained programs are not always
correct problems occur mostly in 5' and 3' exons
22
EST matches confirm the correct prediction
Est matches are often too short to
confirm intron-exon structure
23
Est matches, no obvious ORF
3Est

Annotated feature
5Est
24
example from modeling neurospora crassa genes


no blast matches, no ests we have to rely on
the trained program
25
Summary of gene prediction
  • - gene prediction in higher eukaryotes is a major
    task
  • - in parallel usage of a collection of programs
    increases reliability
  • - combination of extrinsic and intrinsic data
    increases quality
  • gene prediction programs have to be trained on
    organism
  • specific training sets
  • assessment of reliability has to be done BEFORE
    starting
  • large scale genome annotation

26
Annotation routine
gene model
annotated EMBL submission
database
web
  • Extraction
  • GENSCAN
  • Genmark.hmm
  • GENEFINDER
  • Annotation
  • classification
  • functional category
  • cDNA/EST matches
  • prosite patterns
  • literature
  • ...
  • nonred. protein
  • database
  • ESTdatabase
  • Inspection
  • Correction
  • corrected models

web
27
2. Manual annotation
  • predicted genes are corrected acc to ESTs and to
    really good blast
  • matches against "known" proteins (experimentally
    verified
  • proteins ) (BUT ATTENTION) Experimentally
    validated proteins
  • can also be incorrect
  • blastp comparison (corrected genemodels are
    extracted and compared
  • to our nonredundant protein database) to verify
    correction
  • to annotate similarity
  • known proteins are selected from the blastp
    matches
  • time consuming task
  • matching hypothetical yeast proteins from PIR or
    other entries have to
  • be looked up in Yeast databases (MIPS CYGD or SGD
    or YPD)
  • a lot of them are known to exist and have
    validated functions
  • fasta comparison for better impression of
    similarity (similarity from
  • start to stop?)

28
Which of these proteins is a "known" protein?
Four yeast proteins have to be looked up for
meanwhile validation calcium-related ... ?? Is
it a validated protein
29
(No Transcript)
30
(No Transcript)
31
annotated features gene/protein level
  • feature
  • gene structure
  • gene code
  • protein title
  • classification
  • functional classification
  • if possible
  • similarity
  • EST matches
  • motifs
  • only for class 1 proteins
  • EC number
  • cross reference
  • literature .....
  • program/tool
  • blastx (PIR), blastn (ESTs),gene prediction
    algorithms,
  • closest homolog/ probable/
  • related to/hypothetical
  • MIPS classification catalogue
  • MIPS functional catalogue
  • FASTA comparison (Score)
  • blastn ESTs from EMBL)
  • motifs (GCG)
  • PIR
  • PIR, Swissprot
  • MIPS literature database

32
MIPS classification catalogue
33
CLASS 2 strong similarity to known protein
34
CLASS 3 similarity to known protein
35
CLASS 4 similarity or strong similarity to
unknown protein
CLASS 5 similarity to EST
CLASS 6 no similarity
36
MIPS functional catalogue
  • METABOLISM
  • ENERGY
  • CELL GROWTH, CELL DIVISION AND DNA SYNTHESIS
  • TRANSCRIPTION
  • PROTEIN SYNTHESIS
  • PROTEIN DESTINATION
  • TRANSPORT FACILITATION
  • INTRACELLULAR TRANSPORT
  • CELLULAR BIOGENESIS
  • SIGNAL TRANSDUCTION
  • CELL RESCUE,DEFENSE,CELL DEATH AND AGING
  • IONIC HOMEOSTASIS
  • CELLULAR ORGANIZATION
  • RETROTRANSPOSONS AND PLASMID PROTEINS
  • CLASSIFICATION NOT YET CLEAR CUT
  • UNCLASSIFIED PROTEINS

37
text file for annotation
38
4. Automatical annotation
  • submission to PEDANT (Protein Extraction ,
    Description, and
  • Analysis Tool) calculation
  • developed by Dmitrij Frishman
  • additional annotation of potential protein
    function and structure
  • WEB display

39
Databases for finding functional
assignment   PIR-International (PIR keywords,
PIR families and superfamilies) SWISSPROT
PROSITE Protein sites and Patterns InterPro
(PROSITE, Pfam, ProDom, PRINTS) BLOCKS aligned
ungapped segments of highly conserved
regions Pfam collection of common protein
domains COGS Clusters of Orthologous
Groups PubMed Enzyme databases  Expasy
(Swissprot) Enzyme nomenclature database With
crosslinks to WIT (Metabolic reconstruction)
(Pathway) to Kyoto University
Ligand Chemical Database (Pathway)  
Databases for finding structural assignment   PDB
(Experimentally determined three-dimensional
structures) SCOP Structural classification of
Proteins  
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
PIR fam
Pfam dom
BLOCKS BLAST
44
(No Transcript)
45
MATDB MIPS Arabidopsis thaliana database
46
CYGD comprehensive Yeast genome database
YPD (Yeast Proteome Database) SGD (Saccharomyces
Genome Database)
47
(No Transcript)
48
(No Transcript)
49
CYGD - annotation
50
Protein/Protein interactions and functional
networks
M. Fellenberg, MIPS
51
Mapping interactions to functions
mRNA processing cluster thin boxes different
functions light grey transcription dark grey
Splicing Grey areas Processing complex
M. Fellenberg et al., MIPS
52
More information on yet unknown functions
Orphant YNR053c interacts with 5 proteins of
known function. These interact with many
proteins of unknown function.
M. Fellenberg et al., MIPS
53
Functional Analysis Data
  • data sets from large scale experiments
  • data from expression analysis (microarrays)
  • data from proteomics
  • data from ??
  • how can they be integrated into annotated
    genomes to
  • be used best by scientists

54
(No Transcript)
55
Expression profiles
56
THM
  • Sequencing complete genomes is now a fast process
  • Automated gene prediction with specifically
    trained prediction programs
  • is possible, but error-prone
  • Manual supervision and correction is necessary
  • Basic manual annotation in genome projects
    ensures quality, if certain quality
  • standards are regarded
  • Manual annotation is time and cost consuming,
    but there is a payoff!!!
  • Extensive annotation increases the value of a
    genome database remarkably
  • (A.thaliana database has about 10-20 000 accesses
    per day,
  • the yeast database about 10 000)
  • Large experimental data sets have to be
    integrated into the annotation
  • process and added even after finishing a genome
    project
  • Integration of as much experimental results as
    possible is very important
  • and is in the future dependent on the
    collaboration of the
  • scientific community.
Write a Comment
User Comments (0)
About PowerShow.com